paper-spark

https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf

"retaining the scalability and fault tolerance of MapReduce."

"resilient distributed datasets (RDDs)". An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

paper-spark#acyclic-data-flow-graphs1 2 3 4 5These systems achieve their scalability and fault tolerance by providing a programming model where the user creates acyclic data flow graphs to pass input data through a set of operators. This allows the underlying system to manage scheduling and to react to faults without user intervention. paper-spark#acyclic-data-flow-graphs1 2 3 4 5

paper-spark#gradient-descent1Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter (e.g., through gradient descent) paper-spark#gradient-descent1

paper-spark#rdd-lineage1RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition paper-spark#rdd-lineage1

paper-spark#datasets-can-persist-across-operationsWhere Spark differs from other frameworks is that it can make some of the intermediate datasets persist across operations. For example, if wanted to reuse the errs dataset, we could create a cached RDD from it paper-spark#datasets-can-persist-across-operations

Finally, shipping tasks to workers requires shipping closures to them—both the closures used to define a distributed dataset, and closures passed to operations such as reduce. To achieve this, we rely on the fact that Scala closures are Java objects and can be serialized using Java serialization; this is a feature of Scala that makes it relatively straightforward to send a computation to another machine.

Interactive Spark We used the Spark interpreter to load a 39 GB dump of Wikipedia in memory across 15 "m1.xlarge" EC2 machines and query it interactively. The first time the dataset is queried, it takes roughly 35 seconds, comparable to running a Hadoop job on it. However, subsequent queries take only 0.5 to 1 seconds, even if they scan all the data. This provides a qualitatively different experience, comparable to working with local data.

paper-spark#checkpointing-vs-lineage1 2 3While some DSM systems achieve fault tolerance through checkpointing [18], Spark reconstructs lost partitions of RDDs using lineage information captured in the RDD objects paper-spark#checkpointing-vs-lineage1 2 3

paper-spark#lineage-or-provenance1 2Capturing lineage or provenance information for datasets has long been a research topic in the scientific computing an database fields, for applications such as explaining results, allowing them to be reproduced by others, and recomputing data if a bug is found in a work- flow step or if a dataset is lost. We refer the reader to [7], [23] and [9] for surveys of this work paper-spark#lineage-or-provenance1 2

Referring Pages

machine-learning-glossary data-architecture-glossary talk-advanced-spark-training