why RDD are immutable? what is the reason behind it?

why RDD are immutable? what is the reason behind it?

Add Comment
3 Answer(s)

RDD are immutable, because they uses HDFS api .. and same for distributed also

Answered on February 3, 2017.
Add Comment

An RDD is an abstraction of a collection of data distributed on various node. An RDD is characterized by his lineage, the different operations needed to obtain this RDD from others. Also Spark is based on Scala data structures, and functional transformations of data. For these reasons spark uses immutable data structure.

RDDs composed of collection of records which are partitioned. Partition is basic unit of parallelism in a RDD, and each partition is one logical division of data which is immutable and created through some transformations on existing partitions.Immutability helps to achieve consistency in computations.
Users can define their own criteria for partitioning based on keys on which they want to join multiple datasets if needed.

 

When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as  Logistic Regression, K-means clustering, Page rank algorithms, it is fairly common to reuse or share the data among multiple jobs or it may involve multiple ad-hoc queries over a shared data set.This makes it very important to have a very good data sharing architecture so that we can perform fast computations.

 

 

 

Answered on February 3, 2017.
Add Comment

If RDD’s are mutable then it is not possible to maintain the fault tolerance.

I guess there is no separate mechanism for fault tolerance like data replication in spark, Due to benefit of immutability  we can re start from scratch if any failures.

Answered on February 15, 2017.
Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.