PIG QUICK NOTES:
Pig latin – is the language used to analyze data in Hadoop using Apache Pig.
A RELATION is outermost structure of Pig Latin data model. and it is bag where-
-A bag is collection of Tuples
-A tuple is an ordered set of fields
-A field is a piece of data
Pig Latin –Statements
While processing data using Pig Latin, statements are the basic constructs.
1. These statements work with relations. They include expressions and schemas.
2. Every statement ends with a semicolon (;).
3. We will perform various operations using operators provided by Pig Latin, through
statements.
4. Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
5. As soon as you enter a Load statement in the Grunt shell, its semantic checking
will be carried out. To see the contents of the schema, you need to use the Dump
operator. Only after performing the dump operation, the MapReduce job for
loading the data into the file system will be carried out.
load statement ex-
student_data = LOAD ‘/home/student.txt’ USING PigStorage(‘,’) AS (id:int, firstname:chararray, lastname:chararray, phone:int, city:chararray);
DATA TYPES:
Data type | Size | Example |
INT | 32 bit integer | 1 |
LONG | 64 | 2L |
FLOAT | 32 bit float point | 3.3F |
DOUBLE | 64 | 10.5 |
CHARARRAY | character array (string) in Unicode UTF-8 format. | SAI |
BYTEARRAY | Byte array- blob | |
Boolean | True/false | |
Datetime | 1970-01-01T00:00:00.000+00:00 | |
Biginteger | 60708090709 | |
Bigdecimal | 185.98376256272893883 |
Complex datatype
Tuple | A tuple is an ordered set of fields. | Example: (raja, 30) |
Bag | Collection of tuples. | Example:{(raju,30),(Mohhammad,45)} |
Map | A Map is a set of key-value pairs. | Example:[ ‘name’#’Raju’, ‘age’#30] |
Loading and Storing
LOAD | To Load the data from the file system (local/HDFS) into a relation. | |||
STORE | To save a relation to the file system (local/HDFS). | |||
Filtering | ||||
FILTER | To remove unwanted rows from a relation. | Relation2_name = FILTER Relation1_name BY (condition); | Filter_data=FILTER student by city=’Banglore’; | |
DISTINCT | To remove duplicate rows from a relation. | Relation_name2 = DISTINCT Relatin_name1; | Distinct_data = DISTINCT student; | |
FOREACH… GENERATE: | To generate data transformations based on columns of data. | grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data); | foreach_data = FOREACH student_details GENERATE id,age,city;kj | |
STREAM | To transform a relation using an external program. |
Grouping and Joining | |||||
JOIN | To join two or more relations. | Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ; | grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id; | ||
Left outer join | left outer Join operation returns all rows from the left table, even if there are no matches in the right relation. | Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id; | grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id; | full outer join operation returns rows when there is a match in one of the relations. | |
COGROUP | To group the data in two or more relations. | grunt> cogroup_data = COGROUP student_details by age, employee_details by age; | (21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)}, { }) | ||
GROUP | To group the data in a single relation. | Group_data = GROUP Relation_name BY age; | grunt> group_data = GROUP student_details by age; | (21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hyderabad)}) (22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,9848022338,Kolkata)}) | |
CROSS | To create the cross product of two or more relations. | Relation3_name = CROSS Relation1_name, Relation2_name; | cross_data = CROSS customers, orders; | ||
Sorting | |||||
ORDER | To arrange a relation in a sorted order based on one or more fields (ascending or descending). | grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC); | order_by_data = ORDER student_details BY age DESC; | ||
LIMIT | To get a limited number of tuples from a relation. | grunt> Result = LIMIT Relation_name required number of tuples; | limit_data = LIMIT student_details 4; |
Combining and Splitting
UNION | To combine two or more relations into a single relation. | grunt> Relation_name3 = UNION Relation_name1, Relation_name2; | student = UNION student1, student2; |
SPLIT | To split a single relation into two or more relations. | grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2), | SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and age<25); |
Diagnostic Operators | |||
DUMP | To print the contents of a relation on the console. | ||
DESCRIBE | To describe the schema of a relation. | ||
EXPLAIN | To view the logical, physical, or MapReduce execution plans to compute a relation. | ||
ILLUSTRATE | To view the step-by-step execution of a series of statements. |
Load Operator:
You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator of Pig Latin.
Syntax
The load statement consists of two parts divided by the “=” operator. On the left-hand side, we need to mention the name of the relation where we want to store the data, and on the right-hand side, we have to define how we store the data. Given below is the syntax of the Load operator.
Relation_name = LOAD ‘Input file path’ USING function as schema;
Where,
–relation_name – We have to mention the relation in which we want to store the data.
– Input file path – We have to mention the HDFS directory where the file is stored. (In MapReduce mode)
function – We have to choose a function from the set of load functions provided by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader). Or, we can define our own load function.
–Schema – We have to define the schema of the data. We can define the required schema as follows:
(column1 : data type, column2 : data type, column3 : data type);
Note: We load the data without specifying the schema. In that case, the columns will be addressed as $01, $02, etc… (check).
grunt> student = LOAD ‘hdfs://localhost:9000/pig_data/student_data.txt’ USING PigStorage(‘,’)as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Storing Data:
STORE relation_name INTO ‘required path’ [Using function];
Grunt> STORE student into ‘ hdfs://localhost:9000/pig_Output/ ‘ USING PigStorage(‘,’);
Diagnostic Operator:
DUMP : used to print the relation name.
DESCRIBE : used to view the schema of a relation.
DESCRIBE student:
grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray }