Python

How to generate DDL(create statement) with columns using Python[code snippets]

How to generate DDL(create statement) with columns using Python[code snippets]

Data loading is the initial step in Big Data Analytics world, you are supposed to push all the data to Hadoop first and then you can start working on analytics. When loading data to Hadoop environment, in some cases you will be getting data in the form of flat files.

Once the data is loaded, if you want to view data or query this data we need to create HIVE table on top of that data. So it is obvious to create DDL if you want to create hive table. In real time, you have to check the file get the column names and then you have to create DDL manually. This tutorial helps you to get rid of manual work and you can create DDLs dynamically in a single click with Python.

Let’s say we have the incoming data file as shown below –

Name|ID|ContactInfo|Date_emp
Michael|100|547-968-091|2014-01-29
Will|101|527-948-090|2013-10-02
Steven|102|647-968-598|2012-11-03
Lucy|103|577-928-094|2010-01-03

The first line of this file will be the header or schema required to create table.

Below is the code to get the columns from first line of this file and generate DDL file with Python.

#Essential Libraries
import os
import re
import sys
import ConfigParser

#Create File Method. Take 2 parameters:1)Filename 2)Data
def createFile(fName,data):
 file = open(fName,'w+')
 file.write(data)
 file.close()
 return "HQL Created"

inputFile = sys.argv[1] #From which file the column names are to be extracted.
tableName = sys.argv[2] 

f = open(inputFile,'r')
print f.name
line = f.readline() #Retrieves the first line of the input file
header = re.split('\|',line) #Splits the line based on delimiter and adds it to the header list
f.close()

statement_prefix = "CREATE TABLE dbName." + tableName + "("
for i in range(len(header)):
 statement_prefix = statement_prefix + ('\n{} STRING,').format(header[i].upper().strip(),str(header))
 #Appends every element from list to the file with datatype String


statement_suffix = "\n" + "COMMENT 'This is an example table'" + "\n" + "ROW FORMAT DELIMITED" + "\n" + "FIELDS TERMINATED BY '|'" + "\n" + "STORED AS TEXTFILE" + "\n" + "LOCATION '/etlhive';"

statement_hql = statement_prefix[:-1] + "\n)" + statement_suffix #Merge all the elements into one string

createFile(tableName + ".hql",statement_hql) #Call the createFile method to create the hql

Usage of above script –

Save this script into a file with some name, here createDDL.py

python createDDL.py <input file name with path> <table name>

e.g., – python createDDL.py home/user/emp.txt EMPLOYEETABLE

This generates a hql file with tablename you passed.

Hope this helps!!

Share This Post

Avatar
An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.saikumar.me

Lost Password

Register