Spark

Difference between DataFrame and Dataset in Apache Spark

DataFrame Dataset
Spark Release Spark 1.3 Spark 1.6
Data Representation A DataFrame is a distributed collection of data organized into named columns. It is conceptually equal to a table in a relational database. It is an extension of DataFrame API that provides the functionality of – type-safe, object-oriented programming interface of the RDD API and performance benefits of the Catalyst query optimizer and off heap storage mechanism of a DataFrame API.
Data Formats It can process structured and unstructured data efficiently. It organizes the data into named column. DataFrames allow the Spark to manage schema. It also efficiently processes structured and unstructured data. It represents data in the form of JVM objects of row or a collection of row object. which is represented in tabular forms through encoders.
Data Sources API Data source API allows Data processing in different formats (AVRO, CSV, JSON, and storage system HDFS, HIVE tables, MySQL). It can read and write from various data sources that are mentioned above. Dataset API of spark also support data from different sources.
Immutability and Interoperability After transforming into DataFrame one cannot regenerate a domain object. For example, if you generate testDF from testRDD, then you won’t be able to recover the original RDD of the test class. It overcomes the limitation of DataFrame to regenerate the RDD from Dataframe. Datasets allow you to convert your existing RDD and DataFrames into Datasets.
Compile-time type safety If you are trying to access the column which does not exist in the table in such case Dataframe APIs does not support compile-time error. It detects attribute error only at runtime. It provides compile-time type safety.
Optimization Optimization takes place using catalyst optimizer. Dataframes use catalyst tree transformation framework in four phases:

–  Analyzing a logical plan to resolve references.

–  Logical plan optimization.

–  Physical planning.

–  Code generation to compile parts of the query to Java bytecode.

It includes the concept of Dataframe Catalyst optimizer for optimizing query plan.
Serialization Spark DataFrame Can serialize the data into off-heap storage (in memory) in binary format and then perform many transformations directly on this off heap memory because spark understands the schema. There is no need to use java serialization to encode the data. It provides a Tungsten physical execution backend which explicitly manages memory and dynamically generates bytecode for expression evaluation. When it comes to serializing data, the Dataset API in Spark has the concept of an encoder which handles conversion between JVM objects to tabular representation. It stores tabular representation using spark internal Tungsten binary format. Dataset allows performing the operation on serialized data and improving memory use. It allows on-demand access to individual attribute without desterilizing the entire object.
Efficiency/Memory use Use of off heap memory for serialization reduces the overhead. It generates bytecode dynamically so that many operations can be performed on that serialized data. No need for deserialization for small operations. It allows performing an operation on serialized data and improving memory use. It allows on-demand access to individual attribute without deserializing the entire object.
Garbage Collection Avoids the garbage collection costs in constructing individual objects for each row in the dataset. There is also no need for the garbage collector to destroy object because serialization takes place through Tungsten. That uses off heap data serialization.
Lazy Evolution Spark evaluates DataFrame lazily, that means computation happens only when action appears (like display result, save output). It also evaluates lazily as RDD and DataFrame.
Programming Language Support It also has APIs in the different languages like Java, Python, Scala and R. Dataset APIs is currently only available in Scala and Java. Spark version 2.1.1 does not support Python and R.
Schema Projection Auto-discovering the schema from the files and exposing them as tables through the Hive Meta store. Auto discover the schema of the files because of using Spark SQL engine.
Aggregation DataFrame API is very easy to use. It is faster for exploratory analysis, creating aggregated statistics on large data sets. In Dataset it is faster to perform aggregation operation on plenty of data sets.
Usage Area One can use both DataFrame and dataset API when we need a high level of abstraction. For unstructured data, such as media streams or streams of text. You can use both Data Frames or Dataset when you need domain specific APIs. When you want to manipulate your data with functional programming constructs than domain specific expression. We can use either datasets or DataFrame in the high-level expression. For example, filter, maps, aggregation, sum, SQL queries, and columnar access. When you do not care about imposing a schema, such as columnar format while processing or accessing data attributes by name or column. in addition, If we want a higher degree of type safety at compile time.

Share This Post

Avatar
An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.saikumar.me

Lost Password

Register