Hadoop Setup Documents

Click here to download document to Setup Hadoop 2.X Click here to download document for Eclipse setup. Click here to download document for Ubuntu OS

Bucketing in Hive

• Bucketing decomposes data sets into more manageable parts • Users can specify the number of buckets for their data set • Specifying bucketing does not guarantee that table is properly populated • The number of bucket does not vary with data • Bucketing is best suited for sampling • Map-side joins can be done well with bucketing In the below sample code , a hash function will be done on the  ‘emplid’ and similar ids will be placed in the same bucket SET hive.enforce.bucketing = true; or Set mapred.reduce.tasks = <<number of buckets>> CREATE TABLE empdata(emplid INT, fname STRING, lname STRING) PARTITIONED BY (join_dt STRING) CLUSTERED BY (emplid) INTO 64 BUCKETS;

Partitioning in Hive

Partition improves query performance The way Hive structures data storage changes with Partitioning Partitions are stored as sub-directories in the table directory Over Partitioning to be avoided – Each partition creates an HDFS directory with many files in it – It increases large number of small sized files in HDFS – It eventually consume the capacity of namenode as the metadata is kept in main memory by Hadoop Use a partition scheme that creates partitions with size in multiples of the HDFS block size Hive supports dynamic partitions also where partitions are created from query parameters Static partitions are created by the ‘PARTITIONED BY’ clause Dynamic partitions are not enabled by default and if enabled , it works in ‘strict’ mode. The maximum number of dynamic partitions are limite...

Hive Tables

Hive supports 2 types of tables : 1. Managed / Internal tables 2. External Tables Managed Tables – Life cycle of data in the table is controlled by Hive – data is stored under the sub directory defined by ‘ hive.metastore.warehouse.dir ‘ – When table is dropped , data & metadata is deleted – Not a good choice for sharing data with other tools External Tables – Use the keyword EXTERNAL with CREATE TABLE – Life cycle of data in the table is NOT controlled by Hive – data is stored under the directory defined by LOCATION clause in the CREATE TABLE command – When table is dropped , data is not deleted but metadata is deleted – Better choice for sharing data with other tools – Few HiveQL constructs are not allowed for external tables DESCRIBE EXTENDED <<table name>> could be used...

Hive Databases

Hive Databases are like namespaces/catalogs If no database name is specified, ‘default’ database is used We can also use the keyword SCHEMA instead of DATABASE in all the database related commands below. Hive creates a directory for each of the databases it creates The default directory created for the database under a top-level directory specified by the property hive.metastore.warehouse.dir You can specify a different directory using the LOCATION option in the CREATE command Creating Database CREATE DATABASE IF NOT EXISTS Mydb1 WITH DBPROPERTIES ( ‘prop1’=‘value1’; Listing all databases ( regular expressions also allowed in listing dbs) SHOW DATABASES; SHOW DATABASES LIKE ‘M*’; Describing Database DESCRIBE DATABASE Mydb1; // shows DB name, comment & DB directory DESCRIBE DATABASE EXT...

File Formats in Hive

File Format specifies how records are encoded in files Record Format implies how a stream of bytes for a given record are encoded The default file format is TEXTFILE – each record is a line in the file Hive uses different control characters as delimeters in textfiles ᶺA ( octal 001) , ᶺB(octal 002), ᶺC(octal 003), \n The term field is used when overriding the default delimiter FIELDS TERMINATED BY ‘\001’ Supports text files – csv, tsv TextFile can contain JSON or XML documents. Commonly used File Formats – TextFile format Suitable for sharing data with other tools Can be viewed/edited manually SequenceFile Flat files that stores binary key ,value pair SequenceFile offers a Reader ,Writer, and Sorter classes for reading ,writing, and sorting respectively Supports – Uncompr...

Data Model and Datatypes in Hive

Data  in Hive is organised into – Databases –  Namespace to separate table and other data Tables – Homogeneous collection of data having same schema Partitions – Divisions in table data based on key value Buckets – Divisions in partitions based on hash value of a particular column Hive Data Types: Hive supports primitive data types and three collection types. Primitive type – tinyint,   smallint,  int, bigint,   boolean,   string, timestamp, float,   double ,  binary Collection Types – 1. Struct address struct <city:STRING; state:STRING> – Eg: struct (‘Bengaluru’, ‘Karnataka’) and address.city = ‘Bengaluru’ 2. Array names array(‘Hari’, ’Sai’) – Eg: name[1] = Sai 3. Maps name map(‘first’, ‘Mahendra’, ‘last&#...

Hive Metastore Configurations

In order to store meta data Hive can use any of the below three strategies – Embedded – Local – Remote Hive – Metastore – Embedded Mainly used for unit tests Only one process is allowed to connect to the metastore at a time Hive metadata is stored in an embedded Apache Derby database Hive – Metastore – Local Metadata is stored in some other database like MySQL Hive Client will open the connection to datastore and make Hive queries against it Hive – Metastore – Remote All Hive Clients make a connection to the metastore serverand server queries the datastore for metadata. Metastore server and clients will communicate using Thrift protocol.

Hive Architecture

Below are major components in Hive Architecture – UI – The user interface for users to submit queries and other operations to the system. As of 2011 the system had a command line interface and a web based GUI was being developed. Driver – Hive queries are sent to drivers for compilation, optimization and execution Compiler – The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore. Metastore – System catalog which contains metadata about table schemas and other system schemas. Stores in a separate DB – like MySQL Execution Engine – The component which executes the execution plan created by the comp...

Interacting with HIVE – CLI, GUI

Hive Command Line Interface (CLI) – Interaction with Hive is commonly done with CLI Hive CLI is started with the $HIVE_HOME/bin/hive command which is a bash script Prompt for hive is  hive > Using CLI , you can create tables, inspect schema and query tables CLI is a thick client for Hive – it needs local copy of all Hive and Hadoop client components along with their configurations It can work as a JDBC client , MapReduce client or HDFS client Hive Graphical user Interface(GUI) Tools – Ambari – Provided by Hortonworks Distribution HUE – Provided by Cloudera Distribution QuBole Karmasphere

Introduction to Hive – When, What, Why

History- At Facebook the data grew from GBs (2006) to 1 TB/day (2007) and today it is 500+ TBs per day Rapidly grown data made traditional warehousing expensive Scaling up vertically is very expensive Hadoop is an alternative to store and process large data But MapReduce is very low-level and requires custom code Facebook developed Hive as solution Sept 2008 – Hive becomes a Hadoop subproject What is Hive – Hive is a Data Warehouse solution built on Hadoop It is a system for querying, managing and storing structured data on Hadoop An infrastructure on Hadoop for summarization and analysis of data Provides an SQL dialect called Hive QL to process data on Hadoop cluster Hive translates HiveQL queries to Map Reduce Java APIs Hive is not a full database It does not provide record level i...

Pig Quick notes

PIG QUICK NOTES: Pig latin – is the language used to analyze data in Hadoop using Apache Pig. A RELATION is outermost structure of Pig Latin data model. and it is bag where- -A bag is collection of Tuples -A tuple is an ordered set of fields -A field is a piece of data Pig Latin –Statements While processing data using Pig Latin, statements are the basic constructs. 1. These statements work with relations. They include expressions and schemas. 2. Every statement ends with a semicolon (;). 3. We will perform various operations using operators provided by Pig Latin, through statements. 4. Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation as input and produce another relation as output. 5. As soon as you enter a Load statement in the Grunt...

Lost Password


24 Tutorials