Spark

Caching and Persistence – Apache Spark

caching-persistence-in-spark-24tutorials.jpg

Caching and Persistence-
By default, RDDs are recomputed each time you run an action on them.
This can be expensive (in time) if you need to use a dataset more than once.

Spark allows you to control what is cached in memory.

val logs: RDD[String] = sc.textFile("/log.txt")
val logsWithErrors = logs.filter(_.contains("ERROR”)).persist()
val firstnrecords = logsWithErrors.take(10)

Here, we cache logswithErrors in memory.
After firstnrecords is computed, Spark will store the contents of firstnrecords for faster access in future operations if we would like to reuse it.

val numErrors = logsWithErrors.count() //faster result

Now, computing the count on logsWithErrors is much faster.

There are many ways to configure how your data is persisted.
Possible to persist data set:-
-in memory as regular Java objects
-on disk as regular Java objects
-in memory as serialized Java objects (more compact)
-on disk as serialized Java objects (more compact)
-both in memory and on disk (spill over to disk to avoid re-computation)

cache-
Shorthand for using the default storage level, which is in memory only as regular Java objects.

persist-
Persistence can be customized with this method. Pass the storage level you’d like as a parameter to persist.

Level Space Used CPU Time In memory On disk
MEMORY_ONLY High Low Y N
MEMORY_ONLY_SER Low High Y N
MEMORY_AND_DISK High Medium Some Some
MEMORY_AND_DISK_SER Low High Some Some
DISK_ONLY Low High N Y

Share This Post

Avatar
An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.saikumar.me

Lost Password

Register