Most of the data-sets (CSV files, Text files, and so forth.) do have header row inside the data, and loading them in the Hive tables will reason null values. This article will assist you in how to deal with the header rows while creating Internal or external tables.
If you are creating an internal table -
CREATE TABLE IF NOT EXISTS Emp(
empID int,
Ename String,
Sal Decimal,
Dno Int)
row format delimited
fields terminated BY ','
tblproperties("skip.header.line.count"="1");
If you are creating an external table –
CREATE EXTERNAL TABLE IF NOT EXISTS Emp(
empID int,
Ename String,
Sal Decimal,
Dno Int)
row format delimited
fields terminated BY ','
LOCATION ‘/user/cloudera/empdirectory
tblproperties("skip.header.line.count"="1");
If in case you already created a table without specifying “skip.header.line.count” then you still can alter the table properties using the following command.
ALTER TABLE Emp SET TBLPROPERTIES
("skip.header.line.count"="1");
Similarly, if you want to skip the footer line you can use the below in the table properties.
"skip.footer.line.count"="1"
Even after this implementation if you are getting the header
while querying your table, that is probably because vectorization is enabled.
Disabling the feature will help in solving the issue.
set hive.vectorized.execution.enabled=false;
However, vectorization in Hive is a
feature that streamlines the query execution operations by processing a block
of 1024 rows at a time rather than reading one row at a time. If this feature
is enabled it greatly reduces the CPU usage for typical query operations like
scans, filters, aggregates, and joins. But enabling or disabling this feature
is depends on your data volume and type of data you are dealing with.