Spark SQL:
SparkSQL is a Spark module for Structured data processing.
One use of SparkSQL is to execute SQL queries using a basic SQL syntax.
There are several ways to interact with Spark SQL
including SQL, the dataframes API,dataset API.
The backbone for all these operation is Dataframes and SchemaRDD.
DataFrames
A dataFrame is a distributed collection of data organised into named columns. It is conceptually
equivalent to a table in a relational database.
SchemaRDD
SchemaRDDs are made of row objects along with the metadata information.
Spark SQL needs SQLcontext object,which is created from existing SparkContext.
Steps for creating Dataframes,SchemaRDD and performing some operations using the sql methods provided by
sqlContext.
Step 1:
start the spark shell by using the following command.
./bin/spark-shell
Step-2:
import the following packages:
SQLContext entry point for working with Structured data:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Below is used to implicitly convert an RDD to a
DataFrame
import sqlContext.implicits._
Import SparkSQL data types and Row:
Import org.apache.spark.sql._
Step-3
load the data into a new RDD
val data = sc.textFile(“Youf file name”)
Step-4
Define the schema according to your dataset using a case
class
For example your dataset contains product id,product name and product rating then your schema should be
defined as:
case class Product(productid:Int, product_name: String, product_rating: Int);
If you want to check your schema you can use printSchem(), it prints the schema to the console in a tree format.
Step-5
Create an RDD of product objects
val productdata = data.map(x=>x.split(“,”)).
map(p=>Product(p(0).toInt,p(1),p(2).toInt))
Step-6
change productData RDD of Product objects to a
DataFrame.
A DataFrame is a distributed collection of data
organised into named columns. SparkSQL support
automatically converting an RDD containing case classes
to a DataFrame with the method toDF():
val product = productData.toDF()
Step-7
Explore and query the product data with Spark DataFrames
DataFrames provide a domain-specific language for
structed data manipulation in Scala,Java,Python.
Check the basic practical of sparkSQL for examples.
Step-8
Register the dataFrame as a temp table
ex:
product.registerTempTable(“product”)
Now execute the sql statement
ex:
val results = sqlContext.sql(“select * from Product”);
results.show()