The primary concept behind big data analysis is parallelism, defined in computing as the simultaneous execution of processes. The reason for this parallelism is mainly to make analysis faster, but it is also because some data sets may be too dynamic, too large or simply too unwieldy to be placed efficiently in a single relational database. Parallelism is very important concept when it comes to data processing.
Scala achieves Data parallelism in single compute node which is considered as Shared Memory and Spark achieves the data parallelism in the distributed fashion which spread across multiple nodes due to which the processing is very faster.
Shared Memory Data Parallelism(Scala) –
->Split the data
->Workers/threads independently operate on the data in parallel.
->Combine when done.
Scala parallel collections is a collections abstraction over shared memory data-parallel execution.
Distributed Data Parallelism(Spark) –
->Split the data over several nodes.
->Nodes independently operate on the data in parallel.
->Combine when done.
Now we have to worry about network latency between workers. Latency is the time it takes for data packets to be stored or retrieved.
However, like parallel collections, we can keep collections abstraction over distributed data-parallelism execution.
Distribution introduces important concerns beyond what we had to worry about when dealing with parallelism in the shared memory case:-
Partial failure: crash failures of a subset of the machines involved in a distributed computation.
Latency: certain operations have a much higher latency than other operations due to network communication.