Spark

Transformation and Actions in Spark

transformations-and-actions-in-spark-24tutorials.jpg

Transformations and Actions –
Spark defines transformations and actions on RDDs.

Transformations – Return new RDDs as results.
They are lazy, Their result RDD is not immediately computed.
Actions – Compute a result based on an RDD and either returned or saved to an external storage system (e.g., HDFS).
They are eager, their result is immediately computed.
Laziness/eagerness is how we can limit network communication using the programming model.

Example –
Consider the following simple example:
val input: List[String] = List(“hi”,”this”,”is”,”example”)
val words = sc.parallelize(input)
val lengths = words.map(_.length)

Nothing happened on the cluster at this point, execution of map(a transformation) is deferred.
We need to add Action, to kick off the computation and get the results.

val input: List[String] = List(“hi”,”this”,”is”,”example”)
val words = sc.parallelize(input)
val lengths = words.map(_.length)
val totalChars = lengths.reduce(_+_)
println(totalChars)
Result is 15

Some Transformations –

map map[B](f: A -> B): RDD[B]
Apply function to each element in the RDD and return an RDD of the result.
flatMap flatMap[B](f: A -> TravarsableOnce[B]): RDD[B]
Apply a function to each element in the RDD and return an RDD of the contents of the iterators returned.
filter filter(pred: A -> Boolean): RDD[A]
Apply predicate function to each element in the RDD and return an RDD of elements that have passed the predicate condition, pred.
distinct distinct(): RDD[B]
Return RDD with duplicates removed.

Some Actions-

collect collect(): Array[T]
Return all elements from RDD.
count count(): Long
Return the number of elements in the RDD.
take take(num: Int): Array[T]
Return the first num elements of the RDD.
reduce reduce(op: (A, A) -> A): A
Combine the elements in the RDD together using op function and return result.
foreach foreach(f: T -> Unit): Unit
Apply function to each element in the RDD.

Share This Post

Avatar
An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.saikumar.me

Lost Password

Register