Hadoop/MapReduce Vs Spark

By Sai Kumar on February 18, 2018

Hadoop/MapReduce-

Hadoop is a widely-used large-scale batch data processing framework. It’s an open source implementation of Google’s MapReduce.
MapReduce was ground-breaking because it provided:
-> simple API (simple map and reduce steps)
-> fault tolerance
Fault tolerance is what made it possible for Hadoop/MapReduce to scale to 100s or 1000s of nodes at all.

Hadoop/MapReduce + Fault Tolerance
Why is this important?
For 100s or 1000s of old commodity machines. likelihood of at least one node failing is very high midway through a job.
Thus, Hadoop/MapReduce’s ability to recover from node failure enabled:
-> computations on unthinkably large data sets to succeed to completion.
Fault tolerance + simple API
At Google, MapReduce made it possible for an average Google software engineer to craft a complex pipeline of map/ reduce stages on extremely large data sets.

Why Spark?

Fault-tolerance in Hadoop/MapReduce comes at a cost.
Between each map and reduce step, in order to recover from potential failures. Hadoop/MapReduce shuffles its data and write intermediate data to disk:
Remember:
Reading/writing to disk: 1000x slower than in-memory
Network communication: 1,000.000x slower than in-memory
Spark…
->Retains fault-tolerance
->Different strategy for handling latency (latency significantly reduced!)

Achieves this using ideas from functional programming!

Idea: Keep all data immutable and in-memory. All operations on data are just functional transformations, like regular Scala collections. Fault tolerance is achieved by replaying functional transformations over original
dataset.
Result: Spark has been shown to be 100x more performant than Hadoop.
while adding even more expressive APIs.

Sai Kumar

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.24tutorials.com/sai

Hadoop/MapReduce-

Why Spark?

Share This Post

Related Articles

Common issues with Apache Spark

How to Create an Spark RDD?

How to write current date timestamp to log file in Scala[Code Snippet]

Login

Lost Password

Register