Caching and Persistence-
By default, RDDs are recomputed each time you run an action on them.
This can be expensive (in time) if you need to use a dataset more than once.
Spark allows you to control what is cached in memory.
val logs: RDD[String] = sc.textFile("/log.txt") val logsWithErrors = logs.filter(_.contains("ERROR”)).persist() val firstnrecords = logsWithErrors.take(10)
Here, we cache logswithErrors in memory.
After firstnrecords is computed, Spark will store the contents of firstnrecords for faster access in future operations if we would like to reuse it.
val numErrors = logsWithErrors.count() //faster result
Now, computing the count on logsWithErrors is much faster.
There are many ways to configure how your data is persisted.
Possible to persist data set:-
-in memory as regular Java objects
-on disk as regular Java objects
-in memory as serialized Java objects (more compact)
-on disk as serialized Java objects (more compact)
-both in memory and on disk (spill over to disk to avoid re-computation)
Shorthand for using the default storage level, which is in memory only as regular Java objects.
Persistence can be customized with this method. Pass the storage level you’d like as a parameter to persist.
|Level||Space Used||CPU Time||In memory||On disk|