why RDD are immutable? what is the reason behind it?
An RDD is an abstraction of a collection of data distributed on various node. An RDD is characterized by his lineage, the different operations needed to obtain this RDD from others. Also Spark is based on Scala data structures, and functional transformations of data. For these reasons spark uses immutable data structure.
RDDs composed of collection of records which are partitioned. Partition is basic unit of parallelism in a RDD, and each partition is one logical division of data which is immutable and created through some transformations on existing partitions.Immutability helps to achieve consistency in computations.
Users can define their own criteria for partitioning based on keys on which they want to join multiple datasets if needed.
When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as Logistic Regression, K-means clustering, Page rank algorithms, it is fairly common to reuse or share the data among multiple jobs or it may involve multiple ad-hoc queries over a shared data set.This makes it very important to have a very good data sharing architecture so that we can perform fast computations.