Caching and Persistence – Apache Spark


Caching and Persistence-
By default, RDDs are recomputed each time you run an action on them.
This can be expensive (in time) if you need to use a dataset more than once.

Spark allows you to control what is cached in memory.

[code lang=”scala”]val logs: RDD[String] = sc.textFile("/log.txt")
val logsWithErrors = logs.filter(_.contains("ERROR”)).persist()
val firstnrecords = logsWithErrors.take(10)[/code]

Here, we cache logswithErrors in memory.
After firstnrecords is computed, Spark will store the contents of firstnrecords for faster access in future operations if we would like to reuse it.

[code lang=”scala”]val numErrors = logsWithErrors.count() //faster result[/code]

Now, computing the count on logsWithErrors is much faster.

There are many ways to configure how your data is persisted.
Possible to persist data set:-
-in memory as regular Java objects
-on disk as regular Java objects
-in memory as serialized Java objects (more compact)
-on disk as serialized Java objects (more compact)
-both in memory and on disk (spill over to disk to avoid re-computation)

Shorthand for using the default storage level, which is in memory only as regular Java objects.

Persistence can be customized with this method. Pass the storage level you’d like as a parameter to persist.

LevelSpace UsedCPU TimeIn memoryOn disk

Share This Post

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at

Lost Password


Do NOT follow this link or you will be banned from the site!