Steps for creating DataFrames, SchemaRDD and performing operations using SparkSQL

By Sai Kumar on June 15, 2017

Spark SQL:

SparkSQL is a Spark module for Structured data processing.

One use of SparkSQL is to execute SQL queries using a basic SQL syntax.

There are several ways to interact with Spark SQL
including SQL, the dataframes API,dataset API.

The backbone for all these operation is Dataframes and SchemaRDD.

DataFrames

A dataFrame is a distributed collection of data organised into named columns. It is conceptually
equivalent to a table in a relational database.

SchemaRDD

SchemaRDDs are made of row objects along with the metadata information.

Spark SQL needs SQLcontext object,which is created from existing SparkContext.

Steps for creating Dataframes,SchemaRDD and performing some operations using the sql methods provided by
sqlContext.

Step 1:

start the spark shell by using the following command.

./bin/spark-shell

Step-2:

import the following packages:

SQLContext entry point for working with Structured data:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Below is used to implicitly convert an RDD to a
DataFrame

import sqlContext.implicits._
Import SparkSQL data types and Row:
Import org.apache.spark.sql._

Step-3

load the data into a new RDD

val data = sc.textFile(“Youf file name”)

Step-4

Define the schema according to your dataset using a case
class

For example your dataset contains product id,product name and product rating then your schema should be
defined as:

case class Product(productid:Int, product_name: String, product_rating: Int);

If you want to check your schema you can use printSchem(), it prints the schema to the console in a tree format.

Step-5

Create an RDD of product objects

val productdata = data.map(x=>x.split(“,”)).
map(p=>Product(p(0).toInt,p(1),p(2).toInt))

Step-6

change productData RDD of Product objects to a
DataFrame.

A DataFrame is a distributed collection of data
organised into named columns. SparkSQL support
automatically converting an RDD containing case classes
to a DataFrame with the method toDF():
val product = productData.toDF()

Step-7

Explore and query the product data with Spark DataFrames

DataFrames provide a domain-specific language for
structed data manipulation in Scala,Java,Python.
Check the basic practical of sparkSQL for examples.

Step-8

ex:
product.registerTempTable(“product”)
Now execute the sql statement

ex:

val results = sqlContext.sql(“select * from Product”);
results.show()

Sai Kumar

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.24tutorials.com/sai