Spark Hadoop Comparison:
The below the comparison between spark and Hadoop.
Hadoop and Apache Spark are both big-data frameworks, but they don’t really serve the same purposes. Hadoop is essentially a distributed data infrastructure: It distributes massive data collections across multiple nodes within a cluster of commodity servers, which means you don’t need to buy and maintain expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously. Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn’t do distributed storage.
You can use one without the other:
Hadoop includes not just a storage component, known as the Hadoop Distributed File System, but also a processing component called MapReduce, so you don’t need Spark to get your processing done. Conversely, you can also use Spark without Hadoop. Spark does not come with its own file management system, though, so it needs to be integrated with one — if not HDFS, then another cloud-based data platform. Spark was designed for Hadoop, however, so many agree they’re better together.
Spark is speed in processing:
Spark is generally a lot faster than Map Reduce because of the way it processes data. While Map Reduce operates in steps, Spark operates on the whole data set in one fell swoop. The Map Reduce workflow looks like this: read data from the cluster, perform an operation, write results to the cluster, read updated data from the cluster, perform next operation, write next results to the cluster. Spark, on the other hand, completes the full data analytics operations in-memory and in near real-time: Read data from the cluster, perform all of the requisite analytic operations, write results to the cluster. Spark can be as much as 10 times faster than Map Reduce for batch processing and up to 100 times faster for in-memory analytics.
Choose depends on requirement:
MapReduce’s processing style can be just fine if your data operations and reporting requirements are mostly static and you can wait for batch-mode processing. But if you need to do analytics on streaming data, like from sensors on a factory floor, or have applications that require multiple operations, you probably want to go with Spark. Most machine-learning algorithms, for example, require multiple operations. Common applications for Spark include real-time marketing campaigns, online product recommendations, cybersecurity analytics and machine log monitoring.
Hadoop is naturally resilient to system faults or failures since data are written to disk after every operation, but Spark has similar built-in resiliency by virtue of the fact that its data objects are stored in something called resilient distributed datasets distributed across the data cluster. These data objects can be stored in memory or on disks, and RDD provides full recovery from faults or failures.
|Apache Spark||Apache Hadoop|
|Easy to program does not require any abstractions.||Difficult to program and requires abstractions.|
|Programmers and perform streaming, batch processing and machine learning all in the same cluster.||It is used for generating reports that help find answers to historical queries.|
|Has in-built interactive mode.||No in-build interactive mode excepts tools like pig and hive.|
|Executes jobs 10 to 100 times faster than Hadoop Map Reduce.||Hadoop MapReduce does not leverage the memory of the hadoop cluster to the maximum.|
|Programmers can modify the data in real-time through spark streaming.||Allows to just process a batch of stored data.|