Bucketing in Hive

By Sai Kumar on August 25, 2017

• Bucketing decomposes data sets into more manageable parts
• Users can specify the number of buckets for their data set
• Specifying bucketing does not guarantee that table is properly populated
• The number of bucket does not vary with data
• Bucketing is best suited for sampling
• Map-side joins can be done well with bucketing

In the below sample code , a hash function will be done on the ‘emplid’ and similar ids will be placed in the same bucket

SET hive.enforce.bucketing = true; or
Set mapred.reduce.tasks = <<number of buckets>>

CREATE TABLE empdata(emplid INT, fname STRING, lname STRING)
PARTITIONED BY (join_dt STRING)
CLUSTERED BY (emplid) INTO 64 BUCKETS;

Sai Kumar

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.24tutorials.com/sai

Share This Post

Related Articles

File Formats in Hive

Interacting with HIVE – CLI, GUI

Introduction to Hive – When, What, Why

Login

Lost Password

Register