According to StackOverFlow Survey, Apache Spark is Hot, Trending and Highly paid Skill in IT Industry. Apache Spark is extremely popular in the Big Data Analytics world. Here are the frequently asked Apache Spark interview questions to crack Spark job in 2018.
- What is Apache Spark?
Apache Spark is a lighting fast, in-memory(RAM) computation tool to processing big data files stored in Hadoop’s HDFS, NoSQL, or on local systems.
- What are the Spark Ecosystem components?
Spark Core/SQL, Spark Streaming, Spark MLLib, Spark GraphX
- Spark Vs MapReduce
a. Speed: Spark is ten to hundred times faster than MapReduce
b. Analytics: Spark supports streaming, machine learning, complex analytics.
c. Spark is suitable for Real-time processing and Map Reduce is suitable for Batch processing
d. Spark is in memory processing and Map Reduce is local disk processing
- What is RDD?
ROD stands for Resilient Distribution Datasets. An RDD is a fault tolerant collection of operational elements that run in parallel. The partitioned data in RDD is immutable and distributed in nature.
- How is RDD helping for Fault Tolerance?
Since RDDs are created over a set of transformations, it logs those transformations, rather than the actual data. Graph of transformations to produce one RDD is called as Lineage Graph. In case of we lose some Partition of RDD, we can replay the transformation on the partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.
- Features of RDD/ Why we need RDD?
a. In Memory Computation: Spark RDD have a provision of in-memory computation. It stores intermediate result in distributed memory(RAM) instead of stable storage(disk)•
b. Lazy Evaluations: All transformations in apache spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base data set.
c. Fault Tolerance: Spark RDD are fault tolerant as they track data lineage information to rebuild lost data automatically on failure.
d. Immutability: Data is safe to share across processes. It can also be created or retrieved anytime which makes caching, sharing gi replication easy.
e. Partitioning: Partitioning is the fundamental unit of parallelism in Spark RDD. Each partition is one logical division of data which is mutable. We can create a partition through some transformations existing partition.
f. Coarse-grained operations: It applies to all elements in datasets through maps or filter or group by operation.
- What are the Spark RDD operations?
RDD in Apache Spark Support two types of operations:
a. Transformations: Spark RDD Transformations are functions that take an RDD as input and produce one or more RDD as the output. They don’t change the input RDD, but always produce one or more RDDs by applying the computations e. g. Map(), filter(), reduceByKey() … etc. They are two kinds of transformations:
Narrow Transformations: It is the result of map, filter and such that the data is from a single partition only i.e. it is self-sufficient. Spark groups narrow transformations as a stage known as pipelining.
Wide Transformations: It is the result of groupByKey() and reduceByKey() like functions. The data required to compute the records in a single partition may live in many partitions of the Parent RDD. Wide transformations are also known as shuffle transformations because they may or may not depend on a shuffle.
b. Actions: An Action in Spark returns final result of RDD computations. It triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system
- How do we create RDDs in Spark?
There are 3 ways to create RDD in Apache Spark:
a. Using parallelized collection. (parallelized method)
b. From existing Apache Spark RDDs.
c. From external datasets ( textFile() method)
- What is Spark Partitioning?
Resilient Distributed Datasets are the collection of various data items that are huge in size, that they can’t fit into a single node and have partitioned across various nodes. Spark automatically partitions RDDs and distributes the partitions across different nodes. A partition is an atomic chunk of data stored on a node in the cluster.
- What is the use of partitioning in spark?
a. Apache Spark manages data through RD.. using partitions which help parallelize distributed data processing with negligible network traffic for sending data between executors.
b. Communication is very expensive in distributed programming, thus laying out data to minimize network traffic greatly helps improve performance. Just like how a single node program should choose the right data structure fora collection of records, a spark program can control RDD partitioning to reduce communications. Partitioning in Spark might not be helpful for all applications, for instance’ if an RDD is scanned only once, then portioning data within the RDD might not be helpful but if a dataset is reused multiple times in various key oriented operations like joins, then partitioning data will be helpful.
- What are the types of Partitioning in Apache Spark?
Spark has two types of partitioning techniques. One is HashPartitioner and the other is RangePartitioner.
a. HashPartitioner: HashPartitioner works on Java’s Object.hashcode(). The concept of hashcode() is that objects which are equal should have the same hashcode. So based on this hashcode() concept HashPartitioner will divide the keys that have the same hashcode(). It is the default partitioner of Spark. If we did not mention any partitioner then Spark will use this hashpartitioner for partitioning the data.
b. RangePartitioner: If there are sortable records, then range partition will divide the records almost in equal ranges. The ranges are determined by sampling the content of the RDD passed in. First, the RangePartitioner will sort the records based on the key and then it will divide the records into a number of partitions based on the given value.
- What is Spark Context?
Spark Context is entry gate of Spark functionality. The most important step of any spark driver application is to generate Spark Context. It allows Spark application to access Spark Cluster with the help of Resource Manager.
- What is RDD persistence and caching in Spark?
Spark RDD persistence is an optimization technique in which saves the result off RDD evaluation. Using this we save the intermediate result so that we can use it further if required.
- What is the difference between cache() and persist()?
The difference between cache() and persist() is that using cache() the default storage level is MEMORY_ONLY while using persist() we can use various storage levels.
- What is the difference between Map and flatMap in Spark?
Map: A map is transformation operation in Apache Spark. It applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic will be applied to all the elements of RDD.
Flatmap: It is similar to map, but Flatmap allows returning O, 1 or more elements from map function. A Flatmap function takes one element as input process it according to custom code and returns 0 or more elements at a time.
- What is In-memory Computing?
In in-memory computation, the data A kept random access memory (RAM) instead of some slow disk drives and is processed in Parallel.