Resilient Distributed Datasets(RDDs) – Spark

By Sai Kumar on February 18, 2018

Spark implements a distributed data parallel model called Resilient Distributed Datasets(RDDs).

Given some large dataset that can’t fit into memory on a single node.
->Chunk up the data(Diagrams needs to be added)
->Distribute it over the cluster of machines.
->From there, think of your distributed data like a single collection.

RDDs are Spark’s Distributed collections. It seems a lot like immutable sequential or parallel Scala collections.

[code]abstract class RDD[T]{
def map[U](f: T => U): RDD[U] = …
def flatMap[U](f: T => TraversableOnce[U]): RDD[U] = ..
def filter(f; T => Boolean): RDD[T] = …
def reduce(f: (T, T) => T): T = …
}[/code]

Most operations on RDDs, like Scala’s immutable List. and Scala’s parallel collections, are higher-order functions.
That is, methods that work on RDDs, taking a function as an argument and which typically return RDDs.
While their signatures differ a bit, their semantics are the same:

Scala List	Spark RDD
map[B](f: A => B): List[B]	map[B](f: A => B): RDD[B]
flatMap[B](f: A => TraversableOnce[B]): List[B]	flatMap[B](f: A => TraversableOnce[B]): RDD[B]
filter(pred: A => Boolean): List[A]	filter(pred: A => Boolean): RDD[A]
reduce(op: (A. A) => A): A	reduce(op: (A. A) => A): A
fold(z: A)(op: (A, A) => A): A	fold(z: A)(op: (A. A) => A): A
aggregate[B](z: => B)(seqop: (B. A) => B, combop: (B, B) => B): B	aggregate[B](z: B)(seqop: (B, A) => B, combop; (B, B) => B): B

Using RDDs in Spark feels a lot like normal Scala sequential/Parallel collections, with the added knowledge that your data in distributed across machines.

Sai Kumar

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.24tutorials.com/sai

Share This Post

Related Articles

How to filter DataFrame based on keys in Scala List using Spark UDF [Code Snippets]

Why Scala? Why Spark?

Version wise features of Apache Spark

Login

Lost Password

Register