SQLBank: vectorization

Saturday, August 15, 2020

Skipping First and Footer Row - Hive Internal & External Tables

Most of the data-sets (CSV files, Text files, and so forth.) do have header row inside the data, and loading them in the Hive tables will reason null values. This article will assist you in how to deal with the header rows while creating Internal or external tables.

If you are creating an internal table -

CREATE TABLE IF NOT EXISTS Emp(

empID int,

Ename String,

Sal Decimal,

Dno Int)

row format delimited

fields terminated BY ','

tblproperties("skip.header.line.count"="1");

If you are creating an external table –

CREATE EXTERNAL TABLE IF NOT EXISTS Emp(

empID int,

Ename String,

Sal Decimal,

Dno Int)

row format delimited

fields terminated BY ','

LOCATION ‘/user/cloudera/empdirectory

tblproperties("skip.header.line.count"="1");

If in case you already created a table without specifying “skip.header.line.count” then you still can alter the table properties using the following command.

ALTER TABLE Emp SET TBLPROPERTIES ("skip.header.line.count"="1");

Similarly, if you want to skip the footer line you can use the below in the table properties.

"skip.footer.line.count"="1"

Even after this implementation if you are getting the header while querying your table, that is probably because vectorization is enabled. Disabling the feature will help in solving the issue.

set hive.vectorized.execution.enabled=false;

However, vectorization in Hive is a feature that streamlines the query execution operations by processing a block of 1024 rows at a time rather than reading one row at a time. If this feature is enabled it greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. But enabling or disabling this feature is depends on your data volume and type of data you are dealing with.

Saturday, August 15, 2020

Skipping First and Footer Row - Hive Internal & External Tables

Big Data & SQL