Caching and Persistence – Apache Spark


Caching and Persistence-
By default, RDDs are recomputed each time you run an action on them.
This can be expensive (in time) if you need to use a dataset more than once.

Spark allows you to control what is cached in memory.

[code lang=”scala”]val logs: RDD[String] = sc.textFile("/log.txt")
val logsWithErrors = logs.filter(_.contains("ERROR”)).persist()
val firstnrecords = logsWithErrors.take(10)[/code]

Here, we cache logswithErrors in memory.
After firstnrecords is computed, Spark will store the contents of firstnrecords for faster access in future operations if we would like to reuse it.

[code lang=”scala”]val numErrors = logsWithErrors.count() //faster result[/code]

Now, computing the count on logsWithErrors is much faster.

There are many ways to configure how your data is persisted.
Possible to persist data set:-
-in memory as regular Java objects
-on disk as regular Java objects
-in memory as serialized Java objects (more compact)
-on disk as serialized Java objects (more compact)
-both in memory and on disk (spill over to disk to avoid re-computation)

Shorthand for using the default storage level, which is in memory only as regular Java objects.

Persistence can be customized with this method. Pass the storage level you’d like as a parameter to persist.

LevelSpace UsedCPU TimeIn memoryOn disk

Share This Post

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at

Lost Password


24 Tutorials