Thursday, August 13, 2020

COMPLETE DDL Commands in Hive

This article is to cover all the DDL commands in Hive.

 

CREATE

DATABASE LEVEL:

To create the database with properties

CREATE DATABASE TestDB

WITH dbProperties(‘Creator: ‘= ‘Username’, ‘Created-Date:’ = ‘01-01-2020’);

 Or we can just create the database without any properties.

CREATE DATABASE TestDB;

 

TABLE LEVEL:

Creating a table in HIve with different type of data types.

CREATE TABLE Emp(EmpID INT, Ename string, ESal float, DeptNo INT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’;

Or we can simply create table without delimitation specification. 

CREATE TABLE Sample(Line string);

Creating an external table

CREATE EXTERNAL TABLE TestTable

(test_id INT, col2 INT, col3 STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

LOCATION ‘/user/hive/warehouse/datadirectory’;

 
ALTER

Database Level:

Database Properties:

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);

This helps in modifying the database properties which user has provided at the time of database creation.

Example:

ALTER DATABASE TestDB SET dbProperties(‘Edited-by ‘= ‘UserName’, ‘Created-On’=’2020-08-01’);


Database Access:

ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;

This command helps in modifying the access level to a different user or a role.

Example:

ALTER DATABASE TestDB SET OWNER USER cloudera;

 

Database Location:

ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path;

Though it is documented that the database location can be changed, however it is not allowing to do so in Hive 1.1.0 in CDH 5.10.0

 

Database Name:

ALTER DATABASE test_db RENAME TO test_db_new;

Though this is also documented but renaming a database is not allowed in Hive 1.1.0 in CDH 5.10.0

Note: As per the link here, there is no functionality to rename the existing database however there is a workaround to rename the database if you have necessary privileges on Hive metastore without changing the database location.

 

TABLE LEVEL:

Renaming A Table:

ALTER TABLE [old_db_name.]old_table_name RENAME TO [new_db_name.]new_table_name;

Example: ALTER TABLE tbOutput RENAME TO tbResults

 

Adding Columns of an Existing Table:

ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...]);

Example: ALTER TABLE test ADD COLUMNS (Col2 string, Col3 int)

 

Change Column Type

ALTER TABLE name CHANGE column_name new_name new_type;

Example: ALTER TABLE test CHANGE col2 col2 int;

 

Change Column Type and Name

ALTER TABLE name CHANGE column_name new_name new_type;

Example: ALTER TABLE test CHANGE col2 col4 string;

 

Replace Columns in the Table:

ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...]);

The above command used to remove all existing columns and adds the new set of columns. REPLACE is used when you want to have an altogether different columns to your table.

Example: ALTER TABLE test REPLACE COLUMNS (id int, col2 int, col3 string)

 

Change Table’s Location:

ALTER TABLE [DATABASE] SET LOCATION [NewLocation]

Example: ALTER TABLE dbtest SET LOCATION "hdfs:/DB/dbTest";

 

DROP

DATABASE LEVEL

DROP DATABASE TestDB;

By default, the mode is RESTRICT which blocks the deletion of database if it holds tables. Hence use ‘Cascade’ to drop the database even it has tables.

DROP DATABASE TestDB CASCADE;

To drop the database if exists and even it contains tables

DROP DATABASE IF EXISTS TestDB CASCADE;

To avoid dropping the database which has tables.

DROP DATABASE IF EXISTS TestDB RESTRICT;

 

TABLE LEVEL

DROP TABLE tbTest;

 

RENAME

Refer to ‘Alter’ commands.

 

TRUNCATE

Truncate table <tablename>

Example: TRUNCATE TABLE tbTest;

 

LIST OUT THE DATABASES OR TABLES

SHOW DATABASES;

DESC TestDB;

 

DESCRIBE DATABASE EXTENDED TestDB;

Or you can use

DESC DATABASE EXTENDED TestDB;

SHOW TABLES;


Hope you find this article helpful.

Usage of ALTER in Hadoop Hive

Alter is a DDL command which helps in modifying the structure of the database objects. We can change the user provided properties of the database, structure of the tables and name of the objects.

Database Level:

Database Properties:

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name = property_value, ...);

This helps in modifying the database properties which user has provided at the time of database creation.

Example:

ALTER DATABASE TestDB SET dbProperties('Edited'= 'UserName', 'Created’= ’2020-08-01’);

 

Database Access:

ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;

This command helps in modifying the access level to a different user or a role.

Example: ALTER DATABASE TestDB SET OWNER USER cloudera;

 

Database Location:

ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path;

Though it is documented that the database location can be changed, however it is not allowing to do so in Hive 1.1.0 in CDH 5.10.0

 

Database Name:

ALTER DATABASE test_db RENAME TO test_db_new;

Though this is also documented but renaming a database is not allowed in Hive 1.1.0 in CDH 5.10.0

Note: As per the link here, there is no functionality to rename the existing database however there is a workaround to rename the database if you have necessary privileges on Hive metastore without changing the database location.

 

TABLE LEVEL:

Renaming A Table:

ALTER TABLE [old_db_name.]old_table_name RENAME TO [new_db_name.]new_table_name;

Example: ALTER TABLE tbOutput RENAME TO tbResults

 

Adding Columns of an Existing Table:

ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...]);

Example: ALTER TABLE test ADD COLUMNS (Col2 string, Col3 int)

 

Change Column Type

ALTER TABLE name CHANGE column_name new_name new_type;

Example: ALTER TABLE test CHANGE col2 col2 int;

 

Change Column Type and Name

ALTER TABLE name CHANGE column_name new_name new_type;

Example: ALTER TABLE test CHANGE col2 col4 string;

 

Replace Columns in the Table:

ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...]);

The above command used to remove all existing columns and adds the new set of columns. REPLACE is used when you want to have an altogether different columns to your table.

Example: ALTER TABLE test REPLACE COLUMNS (id int, col2 int, col3 string)

 

Change Table’s Location:

ALTER TABLE [DATABASE] SET LOCATION [NewLocation]

Example: ALTER TABLE dbtest SET LOCATION "hdfs:/DB/dbTest";



Hive Analytical Functions

The below is the compiled list of aggregate, analytical & advanced functions in Apache Hive. Some of them are widely used ones which we will be discussed in detail in the upcoming articles. 

• Standard aggregations: Such as COUNT(), SUM(), MIN(), MAX(),  or AVG().

 RANK: The RANK analytics function is used to assign a rank to the rows based on the column values in OVER clause. The row with equal values assigned the same rank with next rank value skipped.

• DENSE_RANK: The DENSE_RANK analytics function in hive used to assign a rank to each row. The rows with equal values receive the same rank and this rank assigned in the sequential order so that no rank values are skipped.

• ROW_NUMBER: It assigns a unique sequence number starting from 1 to each row according to the partition and order specification.

• CUME_DIST: It computes the number of rows whose value is smaller or equal to the value of the total number of rows divided by the current row. This function stands for cumulative distribution. It computes the relative position of a column value in a group. Here, we can calculate the cumulative distribution of salaries among all departments.

• PERCENT_RANK: It is similar to CUME_DIST, but it uses rank values rather than row counts in its numerator as a total number of rows – 1 divided by current rank – 1. Therefore, it returns the percent rank of a value relative to a group of values.

• NTILE: It divides an ordered data set into a number of buckets and assigns an appropriate bucket number to each row. It can be used to divide rows into equal sets and assign a number to each row.

• LEAD: The LEAD function, lead(value_expr[,offset[,default]]), is used to return data from the next row. The number (value_expr) of rows to lead can optionally be specified.

If the number of rows (offset) to lead is not specified, the lead is one row by default. It returns [,default] or null when the default is not specified and the lead for the current row extends beyond
the end of the window.

• LAG: The LAG function, lag(value_expr[,offset[,default]]), is used to access data from a previous row. The number (value_expr) of rows to lag can optionally be specified. If the number of rows (offset) to lag is not specified, the lag is one row by default. It returns [,default] or null when the default is not specified and the lag for the current row extends beyond the end of the window.

• ROUND:  rounds a number to a specified number of decimal places

• FLOOR: rounded up any positive or negative decimal value down to the next least integer value.

• CEIL: is used to get the smallest integer which is greater than, or equal to, the specified numeric expression.

RLIKE: is a relational operator (unlike arithmetic operators such as =, >, <, !=, etc.) is similar to LIKE in SQL. Another relational operator which is equal to RLIKE is, A REGEXP B.

Regexp_extract: it returns the string extracted using the pattern.

APPX_MEDIAN: An aggregate function that returns a value that is approximately the median (midpoint) of values in the set of input values.

InitCap: Used for Proper Case.

Concat: Concatenation can be done using ‘||’ symbol to join strings/columns.

STDDEV: is to calculate standard deviation.

Variance: An aggregate function that returns the variance of a set of numbers. This is a mathematical property that signifies how far the values spread apart from the mean. The return value can be zero (if the input is a single value, or a set of identical values), or a positive number otherwise.

Var_POP: Population variance

Var_SAMP: Returns the unbiased sample variance

STDDEV_POP: Population Standard Deviation

STDDEV_SAMP: Sample Standard Deviation

COVAR_POP: returns the population covariance of a pair of numeric columns in the group

CORR: Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.

PERCENTILE: Returns the exact pth percentile of a column in the group

PERCENTILE_APPROX: Returns an approximate pth percentile of a numeric column (including floating point types) in the group

HISTOGRAM_NUMERIC: Returns a histogram of a numeric column in the group using b non-uniformly spaced bins.

COLLECT_SET: Returns a set of objects with duplicate elements eliminated

COLLECT_LIST: Returns a list of objects with duplicates.

CONCAT_WS: Used for concatenation with specified delimiter.


Wednesday, August 12, 2020

Oracle Analytical Functions

The below is the compiled list of analytical & advanced functions in Oracle SQL*Plus. Some of them are widely used ones which we will be discussed in detail in the upcoming articles.  

Name

Description

CORR

CORR returns the coefficient of correlation of a set of number pairs. You can use it as an aggregate or analytic function.

COVAR_POP

COVAR_POP returns the population covariance of a set of number pairs. You can use it as an aggregate or analytic function.

COVAR_SAMP

COVAR_SAMP returns the sample covariance of a set of number pairs. You can use it as an aggregate or analytic function.

CUME_DIST

Calculate the cumulative distribution of a value in a set of values

DENSE_RANK

Calculate the rank of a row in an ordered set of rows with no gaps in rank values.

FIRST_VALUE

Get the value of the first row in a specified window frame.

LAG

Provide access to a row at a given physical offset that comes before the current row without using a self-join.

LAST_VALUE

Get the value of the last row in a specified window frame.

LEAD

Provide access to a row at a given physical offset that follows the current row without using a self-join.

NTH_VALUE

Get the Nth value in a set of values.

NTILE

Divide an ordered set of rows into a number of buckets and assign an appropriate bucket number to each row.

STDDEV

STDDEV is a built-in function which returns the standard deviation of a set of numbers, i.e. the square root of the variance for the input number set. It can be used as both an Aggregate and an Analytic function.

VARIANCE

VARIANCE returns the variance of expr . You can use it as an aggregate or analytic function. Oracle Database calculates the variance of expr as follows: 0 if the number of rows in expr = 1.

VAR_POP

VAR_POP returns the population variance of a set of numbers after discarding the nulls in this set. You can use it as both an aggregate and analytic function.

VAR_SAMP

VAR_SAMP returns the sample variance of a set of numbers after discarding the nulls in this set. You can use it as both an aggregate and analytic function.

PERCENT_RANK

Calculate the percent rank of a value in a set of values.

RANK

Calculate the rank of a value in a set of values

ROW_NUMBER

Assign a unique sequential integer starting from 1 to each row in a partition or in the whole result

CLUSTER_DETAILS

CLUSTER_DETAILS returns cluster details for each row in the selection. The return value is an XML string that describes the attributes of the highest probability cluster or the specified cluster_id.

FEATURE_DETAILS

FEATURE_DETAILS returns feature details for each row in the selection. The return value is an XML string that describes the attributes of the highest value feature or the specified feature_id.

PREDICTION_DETAILS

PREDICTION_DETAILS returns prediction details for each row in the selection. The return value is an XML string that describes the attributes of the prediction.

LISTAGG

The LISTAGG function concatenates values of the measure_column for each GROUP based on the order_by_clause.

RATIO_TO_REPORT

RATIO_TO_REPORT computes the ratio of a value to the sum of a set of values. If expr evaluates to null, then the ratio-to-report value also evaluates to null.

 

Concatenate rows (group concatenation) in MySQL, Hive, SQL Server and Oracle

The CONCAT() function joins the input strings or different column values and returns as a single string. However, if we need to concatenate rows of strings (rows from the same column) into a single string with a desired separator, this will not work.

There comes a function that fulfills this requirement and this function known with different names in various RDBMS and Big Data Technologies like Hive.

==========

MySQL:

==========

GROUP_CONCAT() is a function which merges the data from multiple rows into one field. It is a GROUP BY function which returns a string. Comma (,) will be used to separate the values in the string.

Example:

SELECT StudentName,GROUP_CONCAT(Subjects)

FROM tbStudentInfo

GROUP BY StudentName;

==========

Hive:

==========

Hive doesn’t have the same functionality like in MySQL however there are two functions collect_set() and CONCAT_WS() to be used to get the desired output.

Separator has to be specified explicitly.

Example:

SELECT StudentName,CONCAT_WS(‘,’, collect_set(Subjects)) as Group_Concat

FROM tbStudentInfo

GROUP BY StudentName;

==========

SQL Server:

==========

STRING_AGG() is the function which is introduced in SQL Server 2017 which is equivalent to MySQL’s GROUP_CONCAT function. There is a workaround to achieve group concatenation using STUFFF() along with FOR XML and PATH() functions if you are using older versions.

SQL Server 2017: Example:

SELECT  StudentName,

              STRING_AGG(Subjects,';') Subjects

FROM tbStudentInfo

GROUP BY

    StudentName;

Prior to SQL 2017: Example:

SELECT  DISTINCT StudentName,

              STUFF((SELECT ','+ a.Subjects FROM tbStudentsInfo a

WHERE a.StudentName = b.StudentName

FOR XML PATH('')),1,1,'') AS Subjects

              FROM tbStudentsInfo b

==========

Oracle 11g:

==========

LISTAGG orders data within each group specified in the ORDER BY clause for a specified measure, and then concatenates the values of the measure column.

SELECT

    StudentName,

    LISTAGG(Subjects, ', ') WITHIN GROUP (ORDER BY Subjects) "Subjeccts"

FROM tbStudentInfo

GROUP BY StudentName



Tuesday, August 11, 2020

DISTINCT vs APPROXIMATE DISTINCT

As all (or most) of us know, SELECT DISTINCT eliminates duplicate records from the results and returns only unique values. But it's very expensive in terms of resources consumption as it requires to sort or hash all rows over all columns in the table.

The use of DISTINCT might be pleasant until the column is a primary key & if the information is less. But the challenge is for huge data where tallying particular qualities can scale up to tens of millions and billions of cardinalities, and across terabytes and petabytes of information.

Apart from just counting, performing aggregation on numerous measurements along, consumes high resources and is real tough job for both traditional RDBMS and for Big Data technologies.

If you are wondering why it's a problem for big data technologies like Hive or Impala while these are the technologies invented to handle such huge data!!

Hive queries invoke MapReduce and MapReduce executes the query using multiple Mappers and one reducer. Every map will send its results to the reducer thus reducer will always have more to process. Processing terabytes or petabytes of data is too expensive for the reduce method due to the high memory consumption as it all happens all in-memory. 

There comes “Approximate Count of Distinct Values” into the picture.

While managing the large datasets, in case (1) if we do not need exact accuracy or (2) if approximate accuracy is acceptable or (3) if we are happy with the figures that may vary +/- 2% than the accurate figures and (4) if we are looking for a better method that is much less memory-intensive for the columns with high cardinality, then we can produce an estimate of the distinct values for a column using this method.

Let’s see what this method is in both traditional and big data technologies.

Impala:

NDV (Number of Distinct Values) in Cloudera’s Impala is an aggregate function that returns an approximate value similar to SELECT DISTINCT.

This method is far plenty quicker than the aggregate of count and distinct and makes use of a steady amount of memory.

According to the Impala’s documentation the outcome is an estimate which might not reflect the precise number of different values in the column, especially if the cardinality is very low or very high. If the estimated number is higher than the number of rows in the table, Impala adjusts the value internally during query planning.

Example:

 

Oracle 12c:

APPROX_COUNT_DISTINCT is the function which returns approximate distinct values in Oracle SQL*Plus. Based on the documentation this function will return "negligible deviation from the exact result".

1. SELECT COUNT(DISTINCT column) FROM Table1;

2. SELECT APPROX_COUNT_DISTINCT(column) FROM Table1;

The first query will return the exact or accurate figures and the second query returns an approximate figure. There is a difference in performance in both the queries; approx_count_distinct takes less time to return the results.


MongoDB:

As per the behavior specified in MongoDB documentation, db.collection. -estimatedDocumentCount() does not take a query filter and instead uses metadata to return the count for a collection. Thus, it is quicker than the countDocuments() which returns accurate information.

Examples:

1. countDocuments()

2. estimatedDocumentCount()

The names are self-explanatory. If we need accurate information, we need to use the first query and if we are okay with the estimation, we need to use the second one.

Update: The same functionality has been introduced in SQL Server 2019, for more information please click here.


Hope you find this article helpful.


Saturday, August 8, 2020

Creating Table From Existing Table in Hive, Impala, SQL Server, Oracle SQL*Plus, PostgreSQL and MySQL

In some scenarios we may need to create a table based on the existing ones. Sometimes we may need the table along with the data and sometimes we may need only the table structure. Most of the RDBMSs, Apache Hive and Cloudera's Impala supports CREATE TABLE.. AS and CREATE TABLE.. LIKE. 

In both the scenarios, we do not specify the columns at all; the column names and types are derived from the source table, query, or data file.

Use of CREATE TABLE ... LIKE is to create an empty table based on the definition of another table, including any column attributes and indexes defined in the original table. On the other hand, CREATE TABLE ... AS is to create the table based on the columns specified in the SELECT statement. It helps in fetch the only columns that required. However, the column definitions will be as same as in source table.

Let's see in which RDBMSs we have these functionalities. 

As stated in the above picture, most of the relational applications do support 'Create Table..As' and 'Create Table..Like' except Microsoft SQL Server. However there is another approach available in SQL Server to accomplish the same task. Before we discuss about SQL Server functionality we will look into the Create table statements which will work in MySQL, Oracle SQL*Plus, PostgreSQL, Hive and Impala.

CREATE TABLE <newTableName> AS <SelectStatement>

CREATE TABLE <newTableName> LIKE <ExistingTableName>

Let's see some examples:

CREATE TABLE EmployeeData AS SELECT EmpID, Ename, Sal FROM Emp;

As you see, we are selecting only required columns and not fetching the DeptNo from Emp table. This will create the table EmployeeData with Three columns along with the data. There is a workaround to create the table with the desired columns definitions without importing the data too.

CREATE TABLE EmployeeData AS SELECT EmpID, Ename, Sal FROM Emp WHERE 1=2;

Since 1=2 is false, the data import will not happen but the table will be created with the specified columns.

CREATE TABLE..LIKE will not give the scope to select the few columns from the existing table. It will copy the source table's structure. Let's see an example -

CREATE TABLE EmployeeData LIKE Emp;

If you are working with PostgreSQL and if you are getting a syntax error then use 

CREATE TABLE EmployeeData (like Emp); 

Now let's check how this can be accomplished using Microsoft SQL Server.

SELECT * INTO EmployeeData FROM Emp;

This statement will copy the Emp table's structure and data into EmployeeData table. Here also we can fetch the columns based on requirement; See the below example to understand.

SELECT EmpID, Ename, Sal INTO EmployeeData FROM Emp;

 If your requirement is to create table without importing the data, you can use a where clause that fails.

SELECT EmpID, Ename, Sal INTO EmployeeData FROM Emp WHERE 1=2

Hope you find this article helpful.



Best Usage of NTILE function in SQL Server and other RDBMSs

SQL Server NTILE() is a window function that distributes rows of an ordered partition right into a unique quantity of approximately same partition or group or buckets. The use of the function is described with a real time problem scenario. 

Problem Scenario: An agency is doing unique sort of promotional events on 1st of each month. The administration wanted to peer, which promotional event has benefited the organization in each region. 

The above code helps to create the table and insert  the data to examine the behavior of NTILE function.    

If you look at the results, the maximum sales from every quarter has been retrieved. This function is available in most of the RDBMS applications. 

Hope you like this article.

Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...