How to flatten JSON in Spark Dataframe


How to flatten whole JSON containing ArrayType and StructType in it?

  • In order to flatten a JSON completely we don’t have any predefined function in Spark. We can write our own function that will flatten out JSON completely.
  • We will write a function that will accept DataFrame. For each field in the DataFrame we will get the DataType.
    1. If the field is of ArrayType we will create new column with exploding the ArrayColumn using Spark explode_outer function.
    2. If the field is of StructType we will create new column with parentfield_childfield for each field in the StructType Field.
  • This is a recursive function. Once the function doesn’t find any ArrayType or StructType. It will return the flattened DataFrame. Otherwise, It will it iterate through the schema to completely flatten out the JSON.

Sample JSON:
“name”:”Bajaj”, “models”:[“Dominor”, “Pulsar”]
“cars”: [
{ “name”:”Ford”, “models”:[ “Fiesta”, “Focus”, “Mustang” ] },
{ “name”:”BMW”, “models”:[ “320”, “X3”, “X5” ] },
{ “name”:”Fiat”, “models”:[ “500”, “Panda” ] }

Schema of JSON DataFrame Before Flattening:

scala> jsonDF.printSchema
 |-- age: long (nullable = true)
 |-- bike: struct (nullable = true)
 |    |-- models: array (nullable = true)
 |    |    |--element:string(containsNull=true)
 |    |-- name: string (nullable = true)
 |-- cars: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- models: array (nullable = true)
 |    |    |    |--element:string(containsNull=true)
 |    |    |-- name: string (nullable = true)
 |-- name: string (nullable = true)
  • Here we can see that Bike Field is StructType with models and name as child fields in it. When we apply our function over this DataFrame. We will get new fields of bike_models and bike_name. Similary all the fields with StructType will create new fields.
  • models field inside the bike field is of ArrayType. This field will be exploded for the values in it. In our case we have two models for Bajaj i.e. Dominor, Pulsar. Exploded column will be like
    • Before Exploding
Bajaj[“Dominor”, “Pulsar”]


  • After Exploding


  • Similarly all the fields with ArrayType will be exploded.

Schema of JSON DataFrame After Flattening

scala> flattenDataframe(jsonDF).printSchema
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)
 |-- bike_name: string (nullable = true)
 |-- bike_models: string (nullable = true)
 |-- cars_name: string (nullable = true)
 |-- cars_models: string (nullable = true)

DataFrame Before Flattening
DataFrame After Flattening

Below is the program which does the purpose –

Share This Post

He is an Enthusiastic, Music Lover, Gadget Freek. He loves to learn and explore new technologies. He is an Software Developer with hands on experience in Hadoop, Scala, Spark, Shell Scripting, Hive and Oracle PL-SQL.

Lost Password


24 Tutorials