“I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it. — Bill Gates”
All transformations in Spark are
lazy. This means that when we tell Spark to create an RDD via transformations of an existing RDD, it won’t generate that dataset until a specific action is performed on it or one of it’s children. Spark will then perform the transformation and the action that triggered it. This allows Spark to run much more efficiently.
Let’s re-examine the function declarations from our earlier Spark example to identify which functions are actions and which are transformations:
weatherData = sc.textFile(inputPath);
Line 16 is neither an action or a transformation; it’s a function of sc, our JavaSparkContext.
tempsByCountry = weatherData.mapToPair(new Func.....
Line 17 is a transformation of the weatherData RDD, in it we map each line of weatherData to a pair comprised of (City, Temperature)
maxTempByCountry = tempsByCountry.reduce(new Func....
Line 26 is also a transformation because we are iterating over key-value pairs. its a transformation of tempsByCountry in which we reduce each city to its highest recorded temperature.
31: maxTempByCountry.saveAsHadoopFile(destPath, String.class, Integer.class, TextOutputFormat.class);
Finally on line 31 we trigger a Spark action: saving our RDD to our file system. Since Spark subscribes to the lazy execution model, it isn’t until this line that Spark generates weatherData, tempsByCountry, and maxTempsByCountry before finally saving our result.
Directed Acyclic Graph
Whenever an action is performed on an RDD, Spark creates a DAG, a finite direct graph with no directed cycles (otherwise our job would run forever). Remember that a graph is nothing more than a series of connected vertices and edges, and this graph is no different. Each vertex in the DAG is a Spark function, some operation performed on an RDD (map, mapToPair, reduceByKey, etc).
In MapReduce, the DAG consists of two vertices:
Map → Reduce.
In our above example of MaxTemperatureByCountry, the DAG is a little more involved:
parallelize → map → mapToPair → reduce → saveAsHadoopFile
The DAG allows Spark to optimize its execution plan and minimize shuffling. We’ll discuss the DAG in greater depth in later posts, as it’s outside the scope of this Spark overview.