Introduction to Hive – When, What, Why


  • At Facebook the data grew from GBs (2006) to 1 TB/day (2007) and today it is 500+ TBs per day
  • Rapidly grown data made traditional warehousing expensive
  • Scaling up vertically is very expensive
  • Hadoop is an alternative to store and process large data
  • But MapReduce is very low-level and requires custom code
  • Facebook developed Hive as solution
  • Sept 2008 – Hive becomes a Hadoop subproject

What is Hive –

  • Hive is a Data Warehouse solution built on Hadoop
  • It is a system for querying, managing and storing structured data on Hadoop
  • An infrastructure on Hadoop for summarization and analysis of data
  • Provides an SQL dialect called Hive QL to process data on Hadoop cluster
  • Hive translates HiveQL queries to Map Reduce Java APIs
  • Hive is not a full database
  • It does not provide record level insert, delete or update
  • Hive do not provide transactions
  • Hive queries have higher latency even for small data sets
  • Most suitable for moving traditional data warehouse applications

Why Hive –

  • Many low-level details to be managed for Jobs when executed on Hadoop Java APIs
  • Map Reduce programming is suitable for experienced Java Programmers
  • Hive provides the familiar programming model like SQL
  • Eliminates the need for writing complex code with Java
  • Hive coding is simple and one need not be an experienced programmer to code in Hive

