SQLBank: hive file formats

Tuesday, September 29, 2020

File formats and compression in Apache Hive

The purpose of this article is to address the different file formats and compression codecs in Apache Hive that are available for different data sets. We will also explore how to use them properly and when to use them.

HiveQL handles structured data only, much like SQL. In order to store the data in it, Hive has a derby database by default. The data will be stored as files in the backend framework while it shows the data in a structured format when it is retrieved. Some special file formats that Hive can handle are available, such as:

Text File Format
Sequence File Format
RCFile (Row column file format)
Avro Files
ORC Files (Optimized Row Columnar file format)
Parquet
Custom INPUTFORMAT and OUTPUTFORMAT

1) TEXT FILE FORMAT:

Hive Text file format is a default storage format to load data from comma-separated values (CSV), tab-delimited, space-delimited, or text files that delimited by other special characters. You can use the text format to interchange the data with other client applications. The text file format is very common for most of the applications. Data is stored in lines, with each line being a record. Each line is terminated by a newline character (\n).

The text file format storage option is defined by specifying "STORED AS TEXTFILE" at the end of the table creation.

2) SEQUENCE FILE FORMAT:

Flat files consisting of binary key-value pairs are sequence files. When converting queries to MapReduce jobs, Hive chooses to use the necessary key-value pairs for a given record. The key advantages of using a sequence file are that it incorporates two or more files into one file.

The sequence file format storage option is defined by specifying "STORED AS SEQUENCEFILE" at the end of the table creation.

3) RCFILE FORMAT:

The row columnar file format is very much similar to the sequence file format. It is a data placement structure designed for MapReduce-based data warehouse systems. This also stores the data as key-value pairs and offers a high row-level compression rate. This will be used when there is a requirement to perform multiple rows at a time. RCFile format is supported by Hive version 0.6.0 and later.

The RC file format storage option is defined by specifying "STORED AS RCFILE" at the end of the table creation.

4) AVRO FILE FORMAT:

Hive version 0.14.0 and later versions support Avro files. It is a row-based storage format for Hadoop which is widely used as a serialization platform. It's a remote procedure call and data serialization framework that uses JSON for defining data types and protocols and serializes data in a compact binary format to make it compact and efficient. This file format can be used in any of the Hadoop’s tools like Pig and Hive.

Avro is one of the common file formats in applications based on Hadoop. The option to store the data in the RC file format is defined by specifying "STORED AS AVRO" at the end of the table creation.

5) ORC FILE FORMAT:

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data in the Hive table. This file system was actually designed to overcome limitations of the other Hive file formats. ORC reduces I/O overhead by accessing only the columns that are required for the current query. It requires significantly fewer seek operations because all columns within a single group of row data are stored together on disk.

The ORC file format storage option is defined by specifying "STORED AS ORC" at the end of the table creation.

6) PARQUET:

Parquet is a binary file format that is column driven. It is an open-source available to any project in the Hadoop ecosystem and is designed for data storage in an effective and efficient flat columnar format compared to row-based files such as CSV or TSV files. It only reads the necessary columns, which significantly reduces the IO and thus makes it highly efficient for large-scale query types. The Parquet table uses Snappy, which is a fast data compression and decompression library, as the default compression.

The parquet file format storage option is defined by specifying "STORED AS PARQUET" at the end of the table creation.

7) CUSTOMER INPUTFORMAT & OUTPUTFORMAT:

We can implement our own "inputformat" and "outputformat" incase the data comes in a different format. These "inputformat" and "outputformat" is similar to Hadoop MapReduce's input and output formats.

In upcoming posts, we will explore more with examples for each and every file format.

Have a nice day..!!

Tuesday, September 29, 2020

File formats and compression in Apache Hive

Big Data & SQL