hadoop / Pig

Pig Quick notes

PIG QUICK NOTES:
Pig latin – is the language used to analyze data in Hadoop using Apache Pig.
A RELATION is outermost structure of Pig Latin data model. and it is bag where-
-A bag is collection of Tuples
-A tuple is an ordered set of fields
-A field is a piece of data

Pig Latin –Statements
While processing data using Pig Latin, statements are the basic constructs.
1. These statements work with relations. They include expressions and schemas.
2. Every statement ends with a semicolon (;).
3. We will perform various operations using operators provided by Pig Latin, through
statements.
4. Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
5. As soon as you enter a Load statement in the Grunt shell, its semantic checking
will be carried out. To see the contents of the schema, you need to use the Dump
operator. Only after performing the dump operation, the MapReduce job for
loading the data into the file system will be carried out.

load statement ex-

student_data = LOAD ‘/home/student.txt’ USING PigStorage(‘,’) AS  (id:int, firstname:chararray, lastname:chararray, phone:int, city:chararray);

DATA TYPES:

Data type Size Example
INT 32 bit integer 1
LONG 64 2L
FLOAT 32 bit float point 3.3F
DOUBLE 64 10.5
CHARARRAY character array (string) in Unicode UTF-8 format. SAI
BYTEARRAY Byte array- blob
Boolean True/false
Datetime 1970-01-01T00:00:00.000+00:00
Biginteger 60708090709
Bigdecimal 185.98376256272893883

Complex datatype

Tuple A tuple is an ordered set of fields. Example: (raja, 30)
Bag Collection of tuples. Example:{(raju,30),(Mohhammad,45)}
Map A Map is a set of key-value pairs. Example:[ ‘name’#’Raju’, ‘age’#30]

Loading and Storing

LOAD To Load the data from the file system (local/HDFS) into a relation.
STORE To save a relation to the file system (local/HDFS).
Filtering    
FILTER To remove unwanted rows from a relation. Relation2_name = FILTER Relation1_name BY (condition); Filter_data=FILTER student by city=’Banglore’;
DISTINCT To remove duplicate rows from a relation. Relation_name2 = DISTINCT Relatin_name1; Distinct_data = DISTINCT student;
FOREACH…

GENERATE:

To generate data transformations based on columns of data. grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data); foreach_data = FOREACH student_details GENERATE id,age,city;kj
STREAM To transform a relation using an external program.

 

Grouping and Joining      
JOIN To join two or more relations. Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ; grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
Left outer join left outer Join operation returns all rows from the left table, even if there are no matches in the right relation. Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id; grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id; full outer join operation returns rows when there is a match in one of the relations.
COGROUP To group the data in two or more relations. grunt> cogroup_data = COGROUP student_details by age, employee_details by age; (21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)},

{ })

GROUP To group the data in a single relation. Group_data = GROUP Relation_name BY age; grunt> group_data = GROUP student_details by age; (21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hyderabad)}) (22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,9848022338,Kolkata)})
CROSS To create the cross product of two or more relations. Relation3_name = CROSS Relation1_name, Relation2_name; cross_data = CROSS customers, orders;
Sorting      
ORDER To arrange a relation in a sorted order based on one or more fields (ascending or descending). grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC); order_by_data = ORDER student_details BY age DESC;
LIMIT To get a limited number of tuples from a relation. grunt> Result = LIMIT Relation_name required number of tuples; limit_data = LIMIT student_details 4;

Combining and Splitting

UNION To combine two or more relations into a single relation. grunt> Relation_name3 = UNION Relation_name1, Relation_name2; student = UNION student1, student2;
SPLIT To split a single relation into two or more relations. grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2), SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and age<25);
Diagnostic Operators    
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
EXPLAIN To view the logical, physical, or MapReduce execution plans to compute a relation.
ILLUSTRATE To view the step-by-step execution of a series of statements.

 

Load Operator:

You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator of Pig Latin.

Syntax

The load statement consists of two parts divided by the “=” operator. On the left-hand side, we need to mention the name of the relation where we want to store the data, and on the right-hand side, we have to define how we store the data. Given below is the syntax of the Load operator.

Relation_name = LOAD ‘Input file path’ USING function as schema;

Where,

relation_name – We have to mention the relation in which we want to store the data.

Input file path – We have to mention the HDFS directory where the file is stored. (In MapReduce mode)

function – We have to choose a function from the set of load functions provided by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader). Or, we can define our own load function.

Schema – We have to define the schema of the data. We can define the required schema as follows:

(column1 : data type, column2 : data type, column3 : data type);

 

Note: We load the data without specifying the schema. In that case, the columns will be addressed as $01, $02, etc… (check).

 

grunt> student = LOAD ‘hdfs://localhost:9000/pig_data/student_data.txt’ USING PigStorage(‘,’)as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

 

Storing Data:

STORE relation_name INTO ‘required path’ [Using function];

Grunt> STORE student into ‘ hdfs://localhost:9000/pig_Output/ ‘ USING PigStorage(‘,’);

 

Diagnostic Operator:

DUMP : used to print the  relation name.

DESCRIBE : used to view the schema of a relation.

                                DESCRIBE student:

                                grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray }

Share This Post

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.saikumar.me

Lost Password

Register