Resilient Distributed Datasets(RDDs) – Spark


Spark implements a distributed data parallel model called Resilient Distributed Datasets(RDDs).

Given some large dataset that can’t fit into memory on a single node.
->Chunk up the data(Diagrams needs to be added)
->Distribute it over the cluster of machines.
->From there, think of your distributed data like a single collection.

RDDs are Spark’s Distributed collections. It seems a lot like immutable sequential or parallel Scala collections.

[code]abstract class RDD[T]{
def map[U](f: T => U): RDD[U] = …
def flatMap[U](f: T => TraversableOnce[U]): RDD[U] = ..
def filter(f; T => Boolean): RDD[T] = …
def reduce(f: (T, T) => T): T = …

Most operations on RDDs, like Scala’s immutable List. and Scala’s parallel collections, are higher-order functions.
That is, methods that work on RDDs, taking a function as an argument and which typically return RDDs.
While their signatures differ a bit, their semantics are the same:

Scala ListSpark RDD
map[B](f: A => B): List[B]map[B](f: A => B): RDD[B]
flatMap[B](f: A => TraversableOnce[B]): List[B]flatMap[B](f: A => TraversableOnce[B]): RDD[B]
filter(pred: A => Boolean): List[A]filter(pred: A => Boolean): RDD[A]
reduce(op: (A. A) => A): Areduce(op: (A. A) => A): A
fold(z: A)(op: (A, A) => A): Afold(z: A)(op: (A. A) => A): A
aggregate[B](z: => B)(seqop: (B. A) => B, combop: (B, B) => B): Baggregate[B](z: B)(seqop: (B, A) => B, combop; (B, B) => B): B

Using RDDs in Spark feels a lot like normal Scala sequential/Parallel collections, with the added knowledge that your data in distributed across machines.

Share This Post

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at

Lost Password


24 Tutorials