Showing posts with label Hive. Show all posts
Showing posts with label Hive. Show all posts

Wednesday, August 12, 2020

Concatenate rows (group concatenation) in MySQL, Hive, SQL Server and Oracle

The CONCAT() function joins the input strings or different column values and returns as a single string. However, if we need to concatenate rows of strings (rows from the same column) into a single string with a desired separator, this will not work.

There comes a function that fulfills this requirement and this function known with different names in various RDBMS and Big Data Technologies like Hive.

==========

MySQL:

==========

GROUP_CONCAT() is a function which merges the data from multiple rows into one field. It is a GROUP BY function which returns a string. Comma (,) will be used to separate the values in the string.

Example:

SELECT StudentName,GROUP_CONCAT(Subjects)

FROM tbStudentInfo

GROUP BY StudentName;

==========

Hive:

==========

Hive doesn’t have the same functionality like in MySQL however there are two functions collect_set() and CONCAT_WS() to be used to get the desired output.

Separator has to be specified explicitly.

Example:

SELECT StudentName,CONCAT_WS(‘,’, collect_set(Subjects)) as Group_Concat

FROM tbStudentInfo

GROUP BY StudentName;

==========

SQL Server:

==========

STRING_AGG() is the function which is introduced in SQL Server 2017 which is equivalent to MySQL’s GROUP_CONCAT function. There is a workaround to achieve group concatenation using STUFFF() along with FOR XML and PATH() functions if you are using older versions.

SQL Server 2017: Example:

SELECT  StudentName,

              STRING_AGG(Subjects,';') Subjects

FROM tbStudentInfo

GROUP BY

    StudentName;

Prior to SQL 2017: Example:

SELECT  DISTINCT StudentName,

              STUFF((SELECT ','+ a.Subjects FROM tbStudentsInfo a

WHERE a.StudentName = b.StudentName

FOR XML PATH('')),1,1,'') AS Subjects

              FROM tbStudentsInfo b

==========

Oracle 11g:

==========

LISTAGG orders data within each group specified in the ORDER BY clause for a specified measure, and then concatenates the values of the measure column.

SELECT

    StudentName,

    LISTAGG(Subjects, ', ') WITHIN GROUP (ORDER BY Subjects) "Subjeccts"

FROM tbStudentInfo

GROUP BY StudentName



Thursday, July 30, 2020

Calculating "Approximate Median" in Cloudera Impala, Apache Hive, SQL Server, Oracle and MySQL

APPROX_MEDIAN is an approximate inverse distribution function that accept a nonstop/continuous dispersion model. It takes a numeric or datetime value and returns an estimated middle value or an approximate interpolated value that would be the middle value once the values are sorted. Nulls are ignored in the calculation.

In short, median is the middle value of a set of ordered data.

Median = {(n + 1) ÷ 2}th value

is the number of values in a set of data.

This function is available few RDBMSs like Oracle SQL*Plus and Cloudera Impala, also in Hive, we can achieve it using PERCENTILE function. If the function is not available in the RDBMS in which you work, we still can get the approximate median value in simple steps which we will discuss later.

Click here to get the “Emp” dataset from my previous post if the table and data not exists in your database.

Let's see how we implement it in Cloudera Impala first.

SELECT appx_median(sal) FROM emp;

Result: 
appx_median(Sal)
20000.00

SELECT DeptID, appx_median(sal) FROM emp GROUP BY DeptID;

 

Result: as shown in the picture.

The same can be accomplished in Hive with a different function.

SELECT DeptID, PERCENTILE(CAST(sal AS INT),0.5) FROM emp GROUP BY DeptID;


Let's try in Oracle SQL*Plus 12c

SELECT department_id "Department",

       APPROX_MEDIAN(salary DETERMINISTIC) "Median Salary"

  FROM employees

  GROUP BY department_id

  

In SQL Server:

The below will work only if the compact mode is 110 or higher

select

  percentile_cont(0.25) within group(order by sal) over () as percentile_cont_25,

  percentile_cont(0.50) within group(order by sal) over () as percentile_cont_50,

  percentile_cont(0.75) within group(order by sal) over () as percentile_cont_75,

  percentile_cont(0.95) within group(order by sal) over () as percentile_cont_95

from emp;

In case of MySQL, there are many ways to calculate the median value. The workarounds can be found here.

Hope you find this article useful in calculating approximate median in Big Data technologies like Cloudera Impala, Apache Hive and various traditional RDBMSs.

Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...