Archives

How to generate DDL(create statement) with columns using Python[code snippets]

Data loading is the initial step in Big Data Analytics world, you are supposed to push all the data to Hadoop first and then you can start working on analytics. When loading data to Hadoop environment, in some cases you will be getting data in the form of flat files. Once the data is loaded, if you want to view data or query this data we need to create HIVE table on top of that data. So it is obvious to create DDL if you want to create hive table. In real time, you have to check the file get the column names and then you have to create DDL manually. This tutorial helps you to get rid of manual work and you can create DDLs dynamically in a single click with Python. Let’s say we have the incoming data file as shown below – Name|ID|ContactInfo|Date_emp Michael|100|547-968-091|2014...

All about Python Classes – Demo with examples

Python Classes are all types – Class Definitions Class Initialization Class Methods

Deep dive into Partitioning in Spark – Hash Partitioning and Range Partitioning

Partitions- The data within an RDD is split into several partitions. Properties of partitions: – Partitions never span multiple machines, i.e., tuples in the same partition are guaranteed to be on the same machine. – Each machine in the cluster contains one or more partitions. – The number of partitions to use is configurable. By default, it equals the total number of cores on all executor nodes. Two kinds of partitioning available in Spark: – Hash partitioning – Range partitioning Customizing a partitioning is only possible on Pair RDDs. Hash partitioning- Given a Pair RDD that should be grouped: val purchasesPerCust = purchasesRdd.map(p -> (p.customerId, p.price)) // Pair RDD .groupByKey() groupByKey first computes per tuple (k, v) its partition p: p = k....

Spark runtime Architecture – How Spark Jobs are executed

How Spark Jobs are Executed- A Spark application is a set of processes running on a cluster. All these processes are coordinated by the driver program. The driver is: -the process where the main() method of your program run. -the process running the code that creates a SparkContext, creates RDDs, and stages up or sends off transformations and actions. These processes that run computations and store data for your application are executors. Executors: -Run the tasks that represent the application. -Return computed results to the driver. -Provide in-memory storage for cached RDDs. Execution of a Spark program: 1. The driver program runs the Spark application, which creates a SparkContext upon start-up. 2. The SparkContext connects to a cluster manager (e.g., Mesos/YARN) which allocates resour...

CUT command in Unix/Linux with examples

Cut Command: – CUT is used to process data in file. – Works only on file having column formatted data Command 1: Display particular position character cut -c3 file.txt Command 2: Range of characters cut -c3-8 file.txt cut -c3- file.txt cut -c-10 file.txt Command 3: Display Columns after seperation cut -d “|” -f2 file.txt cut -d “|” -f2-3 file.txt cut -d “|” -f2- file.txt Command 4: Display all other than given columns[–complement] cut -d “|” -f2 file.txt cut -d “|” –complement -f2 file.txt

GREP command in Unix/Linux with examples

grep – Global Regular Expression Parser It is used to search data in one/more files. Command 1: Search pattern in file - grep hello file.txt - grep sai file.txt file2.txt Command 2: Search pattern in current folder with all txt extensions. grep 1000 *.txt Command 3: Search data in all files in current folder grep 1000 *.* Command 4: Search ignoring case[-i] grep "Sai" file.txt (case sensitive by default) grep -i "Sai" file.txt Command 5: Display line number [-n] grep -n "124" result.txt Command 6: Get only filenames in which data exists[-l] grep -l "100" *.* Command 7: Search exact word [-w] grep -w Sai file.txt Command 8: Search lines of files which does not have that data(reverse of search)[-v] grep -v "1000" file.txt Command 9: - Get one record before the search grep -B 1 "Msd" fi...

SED command in Unix/Linux with examples

SED – Stream Editor Used to display & editing data Editing options are – Insertion/Updation/Deletion 2 Types of Operations ——————— – Lines Addressing – Context Addressing Line Addressing- Command 1: Display line multiple times sed '2p' file.txt sed -n '3p' file.txt (specific line => -n) sed -n '5p' file.txt Command 2: Display last line[$] sed '$p' file.txt (includes last line again along with original) sed -n '$p' file.txt (Specific) Command 3: Range of lines sed -n '2,4p' file.txt Command 4: Do not display specific lines sed -n '2!p' file.txt sed -n '2,4!p' file.txt - do not display specific range of lines(!) Context Addressing: Command 1: Display lines having a specific word sed -n '/Amit/p' file.txt sed -n '/[Aa]mi...

All about AWK command in Unix – Part 1

AWK – select column data -Search data in file and print data on console -Find data of specific columns -Format output data -Used on file with bulk of data for searching, conditional executions, updating, filtering Command 1 ——— Print specific columns awk '{print $1}' file.txt by default TAB seperator awk '{print $1 "--" $2}' file.txt Command 2 – ———– select all data from table awk '{print $0}' tabfile.txt Command 3- ———– select columns from CSV awk -F "," '{print $1}' commafile.txt 1.Seperating data using -F awk -F "," '{print $1}' commafile.txt 2.Using variable (FS) awk '{print $2}' FS="," commafile.txt Command 4- ———- Display content without displaying header of file awk 'NR!=1{print $1 " " $2...

All about AWK command in Unix – Part 2

Command 11 – ———- Find text at the start of line [ ^ ] awk -F "|" '$2-/^s/{print $0}' tabfile.txt Command 12 – ———- Find text at the ent of line [ $ ] awk -F "|" '$2 -/n$/{print $0}' file1.txt Command 13 – ———- perform condition check using if awk -F "|" '{if ($3>2000) print $0;}' file2.txt Command 14 – ———- perform condition check using if-else awk -F "|" '{if($3>=20000) print $2; else print "*****" ; }' file2.txt command 15 – ——– perform condition check using else if awk -F "|" '{ if ($3>=3000) print $2 "your tax is 30%"; else if($3>=2000) print $2 "your tax is 20%"; else print $2 "your tax is 10%;}' file2.txt Command 16 – ——– Begin B...

How to write Current method name to log in Scala[Code Snippet]

You will be having many methods in your application framework, and if want to trace and log current method name then the below code will be helpful for you. def getCurrentMethodName:String = Thread.currentThread.getStackTrace()(2).getMethodName def test{ println("you are in - "+getCurrentMethodName) println("this is doing some functionality") } test Output: you are in – test this is doing some functionality

How to Calculate total time taken for particular method in Spark[Code Snippet]

In some cases where you applied Joins in the spark application, you might want to know the time taken to complete the particular join. Below code snippet might come in handy to achieve so. import java.util.Date val curent = new Date().getTime println(curent) Thread.sleep(30000) val end = new Date().getTime println(end) println("time taken "+(end-curent).toFloat/60000 + "mins") Output: import java.util.Date curent: Long = 1520502573995 end: Long = 1520502603996 time taken 0.5000167mins All you need to do is get current time before method starts and get current time after method ends, then calculate the difference to get total time taken to complete that particular method. Hope this code snippet helps!!

Lost Password

Register

24 Tutorials