Hadoop/MapReduce Vs Spark



Hadoop is a widely-used large-scale batch data processing framework. It’s an open source implementation of Google’s MapReduce.
MapReduce was ground-breaking because it provided:
-> simple API (simple map and reduce steps)
-> fault tolerance
Fault tolerance is what made it possible for Hadoop/MapReduce to scale to 100s or 1000s of nodes at all.

Hadoop/MapReduce + Fault Tolerance
Why is this important?
For 100s or 1000s of old commodity machines. likelihood of at least one node failing is very high midway through a job.
Thus, Hadoop/MapReduce’s ability to recover from node failure enabled:
-> computations on unthinkably large data sets to succeed to completion.
Fault tolerance + simple API
At Google, MapReduce made it possible for an average Google software engineer to craft a complex pipeline of map/ reduce stages on extremely large data sets.

Why Spark?

Fault-tolerance in Hadoop/MapReduce comes at a cost.
Between each map and reduce step, in order to recover from potential failures. Hadoop/MapReduce shuffles its data and write intermediate data to disk:
Reading/writing to disk: 1000x slower than in-memory
Network communication: 1,000.000x slower than in-memory
->Retains fault-tolerance
->Different strategy for handling latency (latency significantly reduced!)

Achieves this using ideas from functional programming!

Idea: Keep all data immutable and in-memory. All operations on data are just functional transformations, like regular Scala collections. Fault tolerance is achieved by replaying functional transformations over original
Result: Spark has been shown to be 100x more performant than Hadoop.
while adding even more expressive APIs.

Share This Post

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at

Lost Password


24 Tutorials