Sunday, August 16, 2020

Computing Running & Cumulative Average

This article helps you in understanding how to calculate running average and cumulative average on a given data-set.

 

As discussed earlier, windowing functions are the same in SQL Server, Oracle & Hive and perform computations across a set of table rows that are related to the current row either by the department, area, category, or date/time.

 

Please observe the sample data-set:


Monthly summarized sales amount by location is specified from which we are about to calculate running average as well as cumulative average.

Running Average / Moving Average / Rolling Average:

The reason to compute a rolling average is to streamline the highs and lows of the dataset and figure out the patterns or trends in the information.

Cumulative Average:

Cumulative or Span of Time is the most common way time is used in a measure. Traditionally Cumulative measures sum data across a span of time.


Both running and cumulative average can be accomplished in a single SELECT statement.

SELECT DISTINCT Location,YYYYMM, Amount,

AVG(Amount) OVER(ORDER BY Location ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) 

    AS 'MovingAverage',

AVG(Amount) OVER(ORDER BY Location ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 

    AS 'CumulativeAverage'

FROM SalesData

GO


Do let me know if you need more clarification.


Saturday, August 15, 2020

Displaying the column names in Hive

In my previous post, I discussed about setting a property in the Hive prompt to print the database name which we are currently working. 

Similarly, there is another Hive configuration property that needs to be enabled in order to view the column names during the data retrieval.

set hive.cli.print.header=true;


Showing the database name in Hive Prompt

Unlike in traditional RDBMS systems, it is difficult to know in which database we are currently working in Hive due to its command-line interface. It always shows the hive prompt.

However, there is a trick!

Set a property to print the current database as part of the prompt to know the database which you are currently in.

set hive.cli.print.current.db=true;


Skipping First and Footer Row - Hive Internal & External Tables

Most of the data-sets (CSV files, Text files, and so forth.) do have header row inside the data, and loading them in the Hive tables will reason null values. This article will assist you in how to deal with the header rows while creating Internal or external tables. 

If you are creating an internal table -

CREATE TABLE IF NOT EXISTS Emp(

empID int,

Ename String,

Sal Decimal,

Dno Int)

row format delimited

fields terminated BY ','

tblproperties("skip.header.line.count"="1");

 

If you are creating an external table –

CREATE EXTERNAL TABLE IF NOT EXISTS Emp(

empID int,

Ename String,

Sal Decimal,

Dno Int)

row format delimited

fields terminated BY ','

LOCATION ‘/user/cloudera/empdirectory

tblproperties("skip.header.line.count"="1");


If in case you already created a table without specifying “skip.header.line.count” then you still can alter the table properties using the following command. 

ALTER TABLE Emp SET TBLPROPERTIES ("skip.header.line.count"="1");

 

Similarly, if you want to skip the footer line you can use the below in the table properties.

"skip.footer.line.count"="1"

Even after this implementation if you are getting the header while querying your table, that is probably because vectorization is enabled. Disabling the feature will help in solving the issue.

set hive.vectorized.execution.enabled=false; 

However, vectorization in Hive is a feature that streamlines the query execution operations by processing a block of 1024 rows at a time rather than reading one row at a time. If this feature is enabled it greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. But enabling or disabling this feature is depends on your data volume and type of data you are dealing with.


Friday, August 14, 2020

Hive Internal Table - With External Data

Have you ever wonder what will happen if you miss the "external" keyword while creating an external table?

Let's check it out.

Here is my sample data. It has three columns namely dno, dname, location.

11, marketing, hyd
12, hr, delhi
13, finance, bang
14, retail, madras
20, db, hyd
21, mg, madras

This file has been placed in /user/cloudera/NewEmpDir as dept.txt

Let us create the external table on top of this data without using "external" keyword.

 

CREATE TABLE newDept

(Dno INT, Dname String, loc String)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

LOCATION '/user/cloudera/NewEmpDir'; 

Surprisingly, the table will be created and the data will also be loaded without any errors.


This is more like an internal table but the data is not stored in Hive. The following things to be observed.

  • Unlike other tables, this table cannot be visible when you browse the files through Hue, though it is visible in Metastore Manager.
  • You can query and retrieve the data from the table.
  • If you drop this table, the source file and directory will also be deleted.


Regular Expressions - RLIKE in Hive

Either SQL Developers or Data Analysts often use arithmetic operators in their queries such as =, >, <, !=

Apart from this, LIKE, EXISTS, IN, NOT IN, NOT EXISTS, etc. will also be used very frequently. 

These are all will help in fetching only the required data.

There is a relational operator (RLIKE) which might not be used the way the above referred to used. But it is helpful in searching the string with similar text the way LIKE operator does in SQL. Both MySQL and Hive provides RLIKE operator that can be used for searching Advanced Regular Expressions.

REGEXP_LIKE in Oracle is equivalent to RLIKE.

Let us see what RLIKE does in Hive.

The below is my “test” table data.

Id

somecol

101

123xyz

102

1234

103

3456

104

abcde

105

Vxyz

106

Bbc


select * from
test where somecol RLIKE '[0-9]+';

This will return all the numbers from the column. If the column value has “123xyz” then it will also be returned. In case if we want to eliminate text contained values and return only numbers then use the following statement.

select * from test where somecol RLIKE '[0-9]+$';

Or you can use-

select * from test where somecol RLIKE '^[0-9]+$';

If you want to retrieve the rows that has no numbers:

SELECT * FROM test WHERE somecol RLIKE '([a-z]|[A-Z])+' 

If you want to retrieve the rows that has numbers with decimal points:

SELECT * FROM test WHERE somecol RLIKE '[0-9]+[.][0-9]+'


Please leave some comments if you need more clarification. 


Thursday, August 13, 2020

"Target-Dir" vs "Warehouse-Dir" in Sqoop

This article is about using “Target-Directory” and “Warehouse-Directory” while Sqoop Import.

Please refer to the below codes.

Code-1: Usage of “Target-Directory”

sqoop import 

--connect jdbc:mysql://localhost/empinfo

--username root

--password cloudera

--table emp

--target-dir /user/hive/warehouse/empinfo;

Code-2: Usage of “Warehouse-Directory”

sqoop import 

--connect jdbc:mysql://localhost/empinfo

--username root

--password cloudera

--table emp

--warehouse-dir /user/hive/warehouse/empinfo;

Both the codes works in the same way. Both ‘target-dir’ and ‘warehouse-dir’ in the above mentioned examples creates the “empinfo” folder in /user/hive/warehouse location.

The difference is, when using “target-dir”, the emp data (part files) will be stored in “empinfo” directly.

The path of the data will be- /user/hive/warehouse/empinfo/part-m-00000.

Warehouse-dir creates the folder named “emp” under “empinfo” and places the data in it.

The path of the data will be /user/hive/warehouse/empinfo/emp/part-m-00000.

 

Note the below points:

  • Target-dir will work only when you import a single table. That implies this won’t work when you use “Sqoop import-all-tables”
  • Warehouse-dir creates the parent directory in which all your tables will be stored in the folders which are named after the table name.
  • If you are importing table by table, each time you need to provide the distinctive target-directory location as target-directory location can’t be same in each import.

Hope you like this article.


Sqoop Import All - Copying MySQL Database to Hive

At the point when I began learning Sqoop, I confronted such huge numbers of blunders for which I was unable to get a lot of help from the discussions or sites. The beneath is one of them which killed parcel of time. 

sqoop import-all-tables

--connect=jdbc:mysql://localhost/retail_db

--username=root

--password=cloudera

--hive-import --hive-database retaildb

--warehouse-dir=/user/hive/warehouse/retaildb.db

-m 1

I wanted to import all the tables and data into Hive database all at once i.e., a database to database import. 

The execution looks fine and I see tables are creating and even I see the data is importing. Be that as it may, after execution, I see just barely any tables created in Hive and not many table's information imported. To confound me more, there is not really any information being shown in Impala considerably even after refreshing the metadata. On the other hand, Hue couldn't open a portion of the documents. 

I did spend a lot of time to figure out what wrong I was doing. 

Finally I got the code right. 

I have assigned a single mapper which caused the issue. The size of the retail_db database is around 450 MB for which assigning a single mapper is inadequate.

         sqoop import-all-tables

--connect=jdbc:mysql://localhost/retail_db

--username=root

--password=cloudera

--hive-import --hive-database retaildb

--warehouse-dir=/user/hive/warehouse/retaildb.db 

The above command creates the internal tables in Hive automatically and data will be imported without any issue.

Import Table From Sqoop to Hive Without Data

The below command helps in creating a table in Hive based on the MySQL table’s definition. 

sqoop create-hive-table

--connect=jdbc:mysql://localhost/retail_db

--username=root

--password=cloudera 

--table employees

--hive-table emp;

This will just create the table structure in Hive as "Emp" based on MySQL's Employees table's structure and no data will be imported. 



If you look at the above screens, the data types are converted automatically. The varchar became string and the decimal became "double" in Hive. 

Before implementing, make sure the table "Employees" exists in MySQL retail_db database and the table "Emp" does not exists in Hive. By default, the table will be imported in "default" database in Hive.

COMPLETE DDL Commands in Hive

This article is to cover all the DDL commands in Hive.

 

CREATE

DATABASE LEVEL:

To create the database with properties

CREATE DATABASE TestDB

WITH dbProperties(‘Creator: ‘= ‘Username’, ‘Created-Date:’ = ‘01-01-2020’);

 Or we can just create the database without any properties.

CREATE DATABASE TestDB;

 

TABLE LEVEL:

Creating a table in HIve with different type of data types.

CREATE TABLE Emp(EmpID INT, Ename string, ESal float, DeptNo INT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’;

Or we can simply create table without delimitation specification. 

CREATE TABLE Sample(Line string);

Creating an external table

CREATE EXTERNAL TABLE TestTable

(test_id INT, col2 INT, col3 STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

LOCATION ‘/user/hive/warehouse/datadirectory’;

 
ALTER

Database Level:

Database Properties:

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);

This helps in modifying the database properties which user has provided at the time of database creation.

Example:

ALTER DATABASE TestDB SET dbProperties(‘Edited-by ‘= ‘UserName’, ‘Created-On’=’2020-08-01’);


Database Access:

ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;

This command helps in modifying the access level to a different user or a role.

Example:

ALTER DATABASE TestDB SET OWNER USER cloudera;

 

Database Location:

ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path;

Though it is documented that the database location can be changed, however it is not allowing to do so in Hive 1.1.0 in CDH 5.10.0

 

Database Name:

ALTER DATABASE test_db RENAME TO test_db_new;

Though this is also documented but renaming a database is not allowed in Hive 1.1.0 in CDH 5.10.0

Note: As per the link here, there is no functionality to rename the existing database however there is a workaround to rename the database if you have necessary privileges on Hive metastore without changing the database location.

 

TABLE LEVEL:

Renaming A Table:

ALTER TABLE [old_db_name.]old_table_name RENAME TO [new_db_name.]new_table_name;

Example: ALTER TABLE tbOutput RENAME TO tbResults

 

Adding Columns of an Existing Table:

ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...]);

Example: ALTER TABLE test ADD COLUMNS (Col2 string, Col3 int)

 

Change Column Type

ALTER TABLE name CHANGE column_name new_name new_type;

Example: ALTER TABLE test CHANGE col2 col2 int;

 

Change Column Type and Name

ALTER TABLE name CHANGE column_name new_name new_type;

Example: ALTER TABLE test CHANGE col2 col4 string;

 

Replace Columns in the Table:

ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...]);

The above command used to remove all existing columns and adds the new set of columns. REPLACE is used when you want to have an altogether different columns to your table.

Example: ALTER TABLE test REPLACE COLUMNS (id int, col2 int, col3 string)

 

Change Table’s Location:

ALTER TABLE [DATABASE] SET LOCATION [NewLocation]

Example: ALTER TABLE dbtest SET LOCATION "hdfs:/DB/dbTest";

 

DROP

DATABASE LEVEL

DROP DATABASE TestDB;

By default, the mode is RESTRICT which blocks the deletion of database if it holds tables. Hence use ‘Cascade’ to drop the database even it has tables.

DROP DATABASE TestDB CASCADE;

To drop the database if exists and even it contains tables

DROP DATABASE IF EXISTS TestDB CASCADE;

To avoid dropping the database which has tables.

DROP DATABASE IF EXISTS TestDB RESTRICT;

 

TABLE LEVEL

DROP TABLE tbTest;

 

RENAME

Refer to ‘Alter’ commands.

 

TRUNCATE

Truncate table <tablename>

Example: TRUNCATE TABLE tbTest;

 

LIST OUT THE DATABASES OR TABLES

SHOW DATABASES;

DESC TestDB;

 

DESCRIBE DATABASE EXTENDED TestDB;

Or you can use

DESC DATABASE EXTENDED TestDB;

SHOW TABLES;


Hope you find this article helpful.

Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...