SQLBank: July 2020

Thursday, July 30, 2020

Calculating "Approximate Median" in Cloudera Impala, Apache Hive, SQL Server, Oracle and MySQL

APPROX_MEDIAN is an approximate inverse distribution function that accept a nonstop/continuous dispersion model. It takes a numeric or datetime value and returns an estimated middle value or an approximate interpolated value that would be the middle value once the values are sorted. Nulls are ignored in the calculation.

In short, median is the middle value of a set of ordered data.

Median = {(n + 1) ÷ 2}^th value

n is the number of values in a set of data.

This function is available few RDBMSs like Oracle SQL*Plus and Cloudera Impala, also in Hive, we can achieve it using PERCENTILE function. If the function is not available in the RDBMS in which you work, we still can get the approximate median value in simple steps which we will discuss later.

Click here to get the “Emp” dataset from my previous post if the table and data not exists in your database.

Let's see how we implement it in Cloudera Impala first.

SELECT appx_median(sal) FROM emp;

Result:

appx_median(Sal)

20000.00

SELECT DeptID, appx_median(sal) FROM emp GROUP BY DeptID;

Result: as shown in the picture.

The same can be accomplished in Hive with a different function.

SELECT DeptID, PERCENTILE(CAST(sal AS INT),0.5) FROM emp GROUP BY DeptID;

Let's try in Oracle SQL*Plus 12c

SELECT department_id "Department",

APPROX_MEDIAN(salary DETERMINISTIC) "Median Salary"

FROM employees

GROUP BY department_id

In SQL Server:

The below will work only if the compact mode is 110 or higher

select

percentile_cont(0.25) within group(order by sal) over () as percentile_cont_25,

percentile_cont(0.50) within group(order by sal) over () as percentile_cont_50,

percentile_cont(0.75) within group(order by sal) over () as percentile_cont_75,

percentile_cont(0.95) within group(order by sal) over () as percentile_cont_95

from emp;

In case of MySQL, there are many ways to calculate the median value. The workarounds can be found here.

Hope you find this article useful in calculating approximate median in Big Data technologies like Cloudera Impala, Apache Hive and various traditional RDBMSs.

Tuesday, July 28, 2020

Multiple Ways to Find Missing Serial Numbers in SQL

In my previous blogs, I had mentioned that there will be many ways to solve a problem. The below is one more example.

Often, in some tables where identity column exists, will have some missing sequences due to some data fixes. Or we may have some sequential numbers or numeric ranges in a table from which we may need to find out the missing number or ranges.

Let's create a temporary table with values before we look into the different methods to accomplish the goal:

/* SOLUTION-1 - Identify only missed numbers */

/* SOLUTION-2 - Identify the range that has missed values */

/* SOLUTION-3 - Identify the range of missed values */

Credits: Marc Gravell

/* SOLUTION-4 - Identify the missed values */

Credits Suprotim Agarwal

If you find any other method, please do share in the comments section.

Monday, July 27, 2020

Handling "Json" and "Unstructured" Data in SQL

The below is to understand how we can handle JSON data in SQL Server 2016 version on-wards.

Sample JSON data is:

{"accountNumber": 2020112, "pin": 2525},

{"accountNumber": 2567899, "pin": 1462}

{"accountNumber": 6789925, "pin": 2614}

{"accountNumber": 9925678, "pin": 6142}

This can be extracted into columns easily in SQL Server with the help of following code:

However, what if the data is not in the exact JSON format i.e. no curly braces, no colon (:) to indicate name-value pairs and no square bracket to hold arrays and values separated by comma (,), in addition to this if it is to be worked on prior to SQL Server 2016 versions?

Sample data:

(accountNumber=2020112)(accountPin=2525)(Phone=+12345678)(countryId=121) (DateOfBirth=19810726)(NumberOfCallsMade=381)

(accountNumber=202019)(accountPin=98291)(Phone=)(countryId=1881) (DateOfBirth=19860526)(NumberOfCallsMade=31)

If you look at the data, there are two rows and are not in same lengthy strings since the values lengths are different. No value provided for Phone attribute in the second row.

The reason behind highlighting these points is, recently I got to work with this data and SUBSTRING function in SQL Server alone is not much a help. I managed to create the report at the end by using PATINDEX along with SUBSTRING functions.

This is to let you know that, within the available resources we can sort out the things even if it looked complicated and unsolvable.

Let me add these two rows in a TEMP table.