Why We Need Apache Spark

The Problem

Calculating max temperatures for each country is a novel task in itself, but it’s hardly groundbreaking analysis. Real world data carries with it more cumbersome schemas and complex analyses, pushing us toward tools that fill our specific niches.

What if instead of max temperature, we were asked to find max temperature by country and city, then we were asked to break this down by day? What if we mixed it up and were asked to find the country with the highest average temperatures? Or if you wanted to find your habitat niche where the temperature is never below 58 or above 68 (Antananarivo Madagascar doesn’t seem so bad).

MapReduce excels at batch data processing, however it lags behind when it comes to repeat analysis and small feedback loops. The only way to reuse data between computations is to write it to an external storage system (a la HDFS). MapReduce writes out the contents of its Maps with each job — before the reduce step. This means each MapReduce job will complete a single task, defined at its onset.

If we wanted to do the above analysis, it would require three separate MapReduce Jobs:

  1. MaxTemperatureMapper, MaxTemperatureReducer, MaxTemperatureRunner
  2. MaxTemperatureByCityMapper, MaxTemperatureByCityReducer, MaxTemperatureByCityRunner
  3. MaxTemperatureByCityByDayMapper, MaxTemperatureByCityByDayReducer, MaxTemperatureByCityByDayRunner

It’s apparent how easily this can run amok.

Data sharing is slow in MapReduce due to the benefits of distributed file systems: replication, serialization, and most importantly disk IO. Many MapReduce applications can spend up to 90% of their time reading and writing from disk.

Having recognized the above problem, researchers set out to develop a specialized framework that could accomplish what MapReduce could not: in memory computation across a cluster of connected machines.

Spark: The Solution

Spark solves this problem for us. Spark provides us with tight feedback loops, and allows us to process multiple queries quickly, and with little overhead.

All 3 of the above Mappers can be embedded into the same spark job, outputting multiple results if desired. The lines above that are commented out could easily be used to set the correct key depending on our specific job requirements.

Spark Implementation of MaxTemperatureMapper using RDDs

Spark will also iterate up to 10x faster than MapReduce for comparable tasks as Spark operates entirely in memory — so it never has to write/read from disk, a generally slow and expensive operation.

read original article here