History-
- At Facebook the data grew from GBs (2006) to 1 TB/day (2007) and today it is 500+ TBs per day
- Rapidly grown data made traditional warehousing expensive
- Scaling up vertically is very expensive
- Hadoop is an alternative to store and process large data
- But MapReduce is very low-level and requires custom code
- Facebook developed Hive as solution
- Sept 2008 – Hive becomes a Hadoop subproject
What is Hive –
- Hive is a Data Warehouse solution built on Hadoop
- It is a system for querying, managing and storing structured data on Hadoop
- An infrastructure on Hadoop for summarization and analysis of data
- Provides an SQL dialect called Hive QL to process data on Hadoop cluster
- Hive translates HiveQL queries to Map Reduce Java APIs
- Hive is not a full database
- It does not provide record level insert, delete or update
- Hive do not provide transactions
- Hive queries have higher latency even for small data sets
- Most suitable for moving traditional data warehouse applications
Why Hive –
- Many low-level details to be managed for Jobs when executed on Hadoop Java APIs
- Map Reduce programming is suitable for experienced Java Programmers
- Hive provides the familiar programming model like SQL
- Eliminates the need for writing complex code with Java
- Hive coding is simple and one need not be an experienced programmer to code in Hive