1.What is the version of spark you are using?
Check the spark version you are using before going to Interview. As per 2020, the latest version of spark is 2.4.x
2.Difference between RDD, Dataframe, Dataset?
RDD – RDD is Resilient Distributed Dataset. It is the fundamental data structure of Spark and is immutable collection of records partitioned across nodes of cluster. It allows us to perform in-memory computations on large clusters in a fault-tolerant manner.
Compared with DF and DS, RDD will not hold the schema. It holds only the data. If user want to implement schema over the RDD, User have to create a case class and have to implement the schema over the data.
We will use RDD for the below cases:
-When our data is unstructured, A streams of text or media streams.
-When we don’t want to implement any schema.
-When we don’t care about the column name attributes while processing or accessing.
-When we want to manipulate the data with functional programming constructs than domain specific expressions.
When we want low-level transformation, actions and control on the dataset.
-Like RDD DataFrames are immutable collection of data.
-Unlike RDD DataFrame will have schema for their data making user to easily access/process large set of data which is distributed among the nodes of cluster.
-DataFrame provides a domain specific language API to manipulate distributed data and makes Spark accessible to a wider audience, beyond specialized data engineers.
-From Spark 2.x Spark DataFrames are nothing but Dataset[Row] or alias (Untyped API)
consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object
Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java. To implement case class on RDD and use as Dataset[T]. Below actions are required.
Since we are using spark 2.x use import spark.implicits._
3.When to use dataframe and when to use dataset?
– From Spark 2.x Dataframes are nothing but Dataset[Row] or alias(Untyped API) and Dataset[T] are strongly typed API
– Each row in the dataset is represented by a user-defined object so that you can refer to an individual column as a member variable of the object. This provides us compile-time safety.
– A dataset has helpers called encoders, which are smart and efficient encoding utilities that convert data inside each user-defined object into a compact binary format.
– This translates into a reduction of memory usage if and when a dataset is cached in memory.
Reduction in the number of bytes that spark need to transfer over a network during shuffling process.
4. Optimization techniques in spark?
• Making use of spark transformation functions where ever its possible.
• Making of use of partitioning methods such as (Hash Pratitioner and Range Pratitioner)
• Modifying the spark-submit parameters(Like implementing dynamic allocation executors, increasing the cores and memory of the driver and executor).
• Storage level – persist or cache
• Coalesce and repartition
5. Ways to create dataframe?
• We create an rdd, apply case class or struct on rdd and import spark.sql.implicits._ and we can use toDF method to create data frame.
• We can also make use of spark.json, csv, other sources to create dataframe.
• We can also create dataframe on top of RDBMS tables and hive tables.
• We can also create dataframe on Hbase table by making use of new api hadoop rdd and case class.
6. How to create dataset?
Create dataframe and use alias as(apply case class)
7. Why to use Spark over MapReduce?
– Advanced DAG Execution Engine that support Acyclic Data Flow – Directed Acyclic Graph – Acyclic Data Flow.
– Intermediate results will be in memory. (In some cases if memory is not enough the part of intermediate result will be written to disk still the Spark will be faster compared to MapReduce).
– Spark can run in Standalone, YARN, Mesos or Kubernetes mode.
-Easy to learn and program in spark.
|– In MapReduce there will be lot of DISK I/O operations to store the intermediate result. Due to which MapReduce will be very slower compared to Spark.|
– MapReduce can only run in YARN mode because it is part of Hadoop.
– Sql, Streaming or machine learning processing cannot be done without help of hive, storm or mahout framework.
8. Differences between yarn client mode and cluster mode?
|Client Mode||Cluster Mode|
|-In client mode the Spark driver will be launched in the node where the spark-submit command is issued which will impact the performance of the node if any task is being performed or to be on that particular node.|
-Mostly we will not launch our job in client mode. If we want to access a from local file system we will be launching the spark in client mode(Testing).
-Testing and debugging application we will launch application in client mode.
-Interactive shell such as spark-shell, pyspark and jupyter note book driver will be client machine.
|-In cluster mode when spark-submit command is issued spark will decide to choose one of the node of the cluster as driver node.|
-In production all the jobs will be running in cluster mode.
9. Explain Spark Architecture?
Checkout this link – Spark Architecture
10. What is spark lineage?
• Basically, in Spark all the dependencies between the RDDs will be logged in a graph, despite the actual data. This is what we call as a lineage graph in Spark.
• RDD – Resilient Distributed Dataset where Resilient means Fault Tolerance. By using RDD Lineage Graph we can re-compute the missing or damaged partition due to node failure.
• Whenever on the basis of the existing RDDs we create new RDDs, using lineage graph spark manage these dependencies. Basically, along with metadata about what type of relationship it has with the parent RDD, each RDD maintains a pointer to one or more parent.
• ToDebugString Method to get RDD Lineage Graph in Spark.
Basically, logical execution plan gets initiated with earliest RDDs. Earliest RDDs are nothing but RDDs which are not dependent on other RDDs.