File Formats in Hive

  • File Format specifies how records are encoded in files
  • Record Format implies how a stream of bytes for a given record are encoded
  • The default file format is TEXTFILE – each record is a line in the file
  • Hive uses different control characters as delimeters in textfiles
    • ᶺA ( octal 001) , ᶺB(octal 002), ᶺC(octal 003), \n
  • The term field is used when overriding the default delimiter
  • Supports text files – csv, tsv
  • TextFile can contain JSON or XML documents.

Commonly used File Formats –

  1. TextFile format
    • Suitable for sharing data with other tools
    • Can be viewed/edited manually
  2. SequenceFile
    • Flat files that stores binary key ,value pair
    • SequenceFile offers a Reader ,Writer, and Sorter classes for reading ,writing, and sorting respectively
    • Supports – Uncompressed, Record compressed ( only value is compressed) and Block compressed ( both key,value compressed) formats
  3. RCFile
    • RCFile stores columns of a table in a record columnar way
  4. ORC
  5. AVRO

