DataFrame Archives - 24 Tutorials

How to create Spark DataFrame from different sources

Suriya July 25, 2021 Comments Closed

Creating DataFrame from Scala List or Sequence In some cases in order to test our business logic we need to have DataFrame and in most cases we would have created DataFrame from a sample file. Instead of doing that we can create a List of our sample data and we can convert it to DataFrame. Note : spark.implicits._ will be available in spark-shell by default. In case if we want to test in IDE we should import spark.implicits._ explicitly. From CSV Source From Parquet Source From Avro Source From JSON Source Using Spark StructType schema to create DataFrame on File Sources Using Spark StructType JSON Schema to create DataFrame on File Sources In some cases we may require to have a external StructType Schema in such cases we can define the StructType as JSON and store it as file and during ru...

Spark

Joins in Spark SQL- Shuffle Hash, Sort Merge, BroadCast

Sai Kumar April 22, 2020 Comments Closed

Apache Spark SQL component comes with catalyst optimizer which smartly optimizes the jobs by re-arranging the order of transformations and by implementing some special joins according to datasets. Spark performs these joins internally or you can force it to perform them. It’s worthwhile to know this topic, so that it comes to rescue when optimizing the jobs according to your use case. Shuffle Hash Join Shuffle hash join shuffles the data based on join keys and then perform the join. The shuffled hash join ensures that data on each partition will contain the same keys by partitioning the second dataset with the same default partitioner as the first, so that the keys with the same hash value from both datasets are in the same partition. It follows the classic map-reduce pattern: First ...

Spark

How to flatten JSON in Spark Dataframe

Suriya February 13, 2019 No Comments

How to flatten whole JSON containing ArrayType and StructType in it? In order to flatten a JSON completely we don’t have any predefined function in Spark. We can write our own function that will flatten out JSON completely. We will write a function that will accept DataFrame. For each field in the DataFrame we will get the DataType. If the field is of ArrayType we will create new column with exploding the ArrayColumn using Spark explode_outer function. If the field is of StructType we will create new column with parentfield_childfield for each field in the StructType Field. This is a recursive function. Once the function doesn’t find any ArrayType or StructType. It will return the flattened DataFrame. Otherwise, It will it iterate through the schema to completely flatten out the JSON...

Spark

How to create Spark Dataframe on HBase table[Code Snippets]

Sai Kumar February 3, 2019 No Comments

There is no direct library to create Dataframe on HBase table like how we read Hive table with Spark sql. This post gives the way to create dataframe on top of Hbase table. You need to add hbase-client dependency to achieve this. Below is the link to get the dependency. https://mvnrepository.com/artifact/org.apache.hbase/hbase-client/2.1.0 Lets say the hbase table is ’emp’ with rowKey as ’empID’ and columns are ‘name’ and ‘city’ under the column-family named – ‘metadata’. Case class -EmpRow is used in order to give the structure to the dataframe. newAPIHadoopRDD is the API available in Spark to create RDD on hbase, configurations need to passed as shown below. Dataframe will be created when you parse this RDD on case class. ...

How to filter DataFrame based on keys in Scala List using Spark UDF

Spark

How to filter DataFrame based on keys in Scala List using Spark UDF [Code Snippets]

Sai Kumar March 7, 2018 No Comments

There are some situations where you are required to Filter the Spark DataFrame based on the keys which are already available in Scala collection. Let’s see how we can achieve this in Spark. You need to use spark UDF for this – Step -1: Create a DataFrame using parallelize method by taking sample data. scala> val df = sc.parallelize(Seq((2,"a"),(3,"b"),(5,"c"))).toDF("id","name") df: org.apache.spark.sql.DataFrame = [id: int, name: string] Step -2: Create a UDF which concatenates columns inside dataframe. Below UDF accepts a collection of columns and returns concatenated column separated by the given delimiter. scala> val concatKey = udf( (xs: Seq[Any], sep:String) => xs.filter(_ != null).mkString(sep)) concatKey: org.apache.spark.sql.UserDefinedFunction = UserDefinedFu...

How to create Spark DataFrame from different sources

Joins in Spark SQL- Shuffle Hash, Sort Merge, BroadCast

How to flatten JSON in Spark Dataframe

How to create Spark Dataframe on HBase table[Code Snippets]

How to filter DataFrame based on keys in Scala List using Spark UDF [Code Snippets]

Login

Lost Password

Register