Read the Spark paper
If you are interested in Spark, you may have heard of the Spark paper,Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. All the cool programmers reference these papers (GFS, MapReduce,BigTable, and while I certainly would not want anyone to confuse me with them, I thought I might give this Spark paper a read. As Spark represents such a seemingly radical departure from MapReduce, I am curious as to how this is possible. Plus, I can't code but I should be able to read...
Actually reading the Spark paper
Wow. I have no idea what any of these words mean (I may have been a bit optimistic about my reading capability). And worse, it is easy to see the authors worked hard to be as simple as possible. Thankfully, Google comes to the rescue. Below are some terms I was not familiar with:
Fault-Tolerant - According to wikipedia, fault-tolerance, is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its component. Okay, this makes sense. Hadoop has always been about handling node failure without losing data. So Spark will do this as well. Got it.
In-Memory Cluster Computing - This one we can discern as a sum of its parts. We understand in-memory to refer to the fact that data is loaded into RAM rather than read from disk (which is much, much faster), and we know cluster computing to mean lots of nodes set up to work together.
Iterative Algorithm - Have to go back to wikipedia for this one: "mathematical procedure that generates a sequence of improving approximate solutions for a class of problems". So rather than trying to solve a problem directly, an iterative algorithm makes some sort of guess and then continues to make guesses that are closer and closer to the correct answer. The guessing part is something I'm familiar with although given my C in Calculus I can't claim to have consistently converged on the right answer.
Interactive Data Mining Tools Paraphrasing this paper, it appears that interactive refers to the fact that the user can guide the data mining algorithms. It is important to note for my buzzword-spewing business analyst friends that the data mining buzzword, used to describe any activity associated with analyzing data in Excel, is not what we are referring to here. By data mining, we mean the field of computer science associated with finding patterns in data. This work is done with algorithms. I'm not poo-pooing your Pivot Table, you know I love them, but it's just not a data mining tool.
Coarse-Grained Transformations - Thank you StackOverflow for this answer. It appears that coarse-grained transformations are the approach that Spark uses (as opposed to fine-grained updates) to keep track of your original data and the transformations you have done to that data rather than saving each updated copy of the data itself. It sounds like in the Big Data world saving each state of the data as it is updated is computationally expensive.
Dryad - Dryad is mentioned several times throughout the Spark paper as a project that Spark compares itself to. I had never heard of it (no surprise, of course), and it turns out Dryad was a Microsoft research project for working with Big Data. According to the wikipedia page, Microsoft dropped Dryad in 2011 in favor of Hadoop. I think the interesting tidbit from the above wikipedia entry is the mention of another term I don't know - DAG
DAG - Directed Acyclic Graph. A directed graph with no directed circles (what? I think I was sick the day they covered this in Math class). DAGs can be used to efficiently model data. Okay, I'm cool with that. But do we see DAGs in Spark? Yes, we do! Spark uses a DAG engine to perform and track the transformations on the data. Whereas MapReduce has two specified steps (Map & Reduce), Spark's DAG engine allows any number of steps.According to the linked article, this helps make Spark faster, although admittedly I'm not sure why. Could you not write a MapReduce job that only included the Map phase? At least in Pig I think I've done that.
Resilient Distributed Datasets - We can't have a definition of Spark terms without the eponymous RDD. According to the Spark paper, the RDD is a read-only, partitioned collection of records. At a higher level, this means RDDs are fault-tolerant and can be operated on in parallel. (At least I think I can make that jump...). You can create RDDs from data in stable storage or from other RDDs.
Okay. I think we have covered enough Spark terms to get through some of this paper (at least the title). I think I have a better understanding of Spark, but I'm unsure as of yet if that will improve my coding. We shall see.