Thursday, July 30, 2020

Calculating "Approximate Median" in Cloudera Impala, Apache Hive, SQL Server, Oracle and MySQL

APPROX_MEDIAN is an approximate inverse distribution function that accept a nonstop/continuous dispersion model. It takes a numeric or datetime value and returns an estimated middle value or an approximate interpolated value that would be the middle value once the values are sorted. Nulls are ignored in the calculation.

In short, median is the middle value of a set of ordered data.

Median = {(n + 1) ÷ 2}th value

is the number of values in a set of data.

This function is available few RDBMSs like Oracle SQL*Plus and Cloudera Impala, also in Hive, we can achieve it using PERCENTILE function. If the function is not available in the RDBMS in which you work, we still can get the approximate median value in simple steps which we will discuss later.

Click here to get the “Emp” dataset from my previous post if the table and data not exists in your database.

Let's see how we implement it in Cloudera Impala first.

SELECT appx_median(sal) FROM emp;

Result: 
appx_median(Sal)
20000.00

SELECT DeptID, appx_median(sal) FROM emp GROUP BY DeptID;

 

Result: as shown in the picture.

The same can be accomplished in Hive with a different function.

SELECT DeptID, PERCENTILE(CAST(sal AS INT),0.5) FROM emp GROUP BY DeptID;


Let's try in Oracle SQL*Plus 12c

SELECT department_id "Department",

       APPROX_MEDIAN(salary DETERMINISTIC) "Median Salary"

  FROM employees

  GROUP BY department_id

  

In SQL Server:

The below will work only if the compact mode is 110 or higher

select

  percentile_cont(0.25) within group(order by sal) over () as percentile_cont_25,

  percentile_cont(0.50) within group(order by sal) over () as percentile_cont_50,

  percentile_cont(0.75) within group(order by sal) over () as percentile_cont_75,

  percentile_cont(0.95) within group(order by sal) over () as percentile_cont_95

from emp;

In case of MySQL, there are many ways to calculate the median value. The workarounds can be found here.

Hope you find this article useful in calculating approximate median in Big Data technologies like Cloudera Impala, Apache Hive and various traditional RDBMSs.

Tuesday, July 28, 2020

Multiple Ways to Find Missing Serial Numbers in SQL

In my previous blogs, I had mentioned that there will be many ways to solve a problem. The below is one more example. 

Often, in some tables where identity column exists, will have some missing sequences due to some data fixes. Or we may have some sequential numbers or numeric ranges in a table from which we may need to find out the missing number or ranges.

Let's create a temporary table with values before we look into the different methods to accomplish the goal:


/* SOLUTION-1 - Identify only missed numbers   */


/* SOLUTION-2 - Identify the range that has missed values */


/* SOLUTION-3 - Identify the range of missed values */

Credits: Marc Gravell


/* SOLUTION-4 - Identify the missed values                 */

Credits Suprotim Agarwal

 

 If you find any other method, please do share in the comments section.

Monday, July 27, 2020

Handling "Json" and "Unstructured" Data in SQL

The below is to understand how we can handle JSON data in SQL Server 2016 version on-wards.

Sample JSON data is:

  {"accountNumber": 2020112, "pin": 2525},

  {"accountNumber": 2567899, "pin": 1462}

  {"accountNumber": 6789925, "pin": 2614}

  {"accountNumber": 9925678, "pin": 6142}

This can be extracted into columns easily in SQL Server with the help of following code:

However, what if the data is not in the exact JSON format i.e. no curly braces, no colon (:) to indicate name-value pairs and no square bracket to hold arrays and values separated by comma (,), in addition to this if it is to be worked on prior to SQL Server 2016 versions?

Sample data:

(accountNumber=2020112)(accountPin=2525)(Phone=+12345678)(countryId=121) (DateOfBirth=19810726)(NumberOfCallsMade=381)

(accountNumber=202019)(accountPin=98291)(Phone=)(countryId=1881) (DateOfBirth=19860526)(NumberOfCallsMade=31)

If you look at the data, there are two rows and are not in same lengthy strings since the values lengths are different. No value provided for Phone attribute in the second row.

The reason behind highlighting these points is, recently I got to work with this data and SUBSTRING function in SQL Server alone is not much a help. I managed to create the report at the end by using PATINDEX along with SUBSTRING functions.

This is to let you know that, within the available resources we can sort out the things even if it looked complicated and unsolvable.

Let me add these two rows in a TEMP table.

 

Now let’s see how can we convert these strings into columns. 

SELECT

(SELECT LEFT(

     Val,PATINDEX('%[^0-9]%', Val+'a')-1) from(

 SELECT SUBSTRING(SUBSTRING(StringVal,16,7), PATINDEX('%[0-9]%', SUBSTRING(StringVal,16,7)), LEN(SUBSTRING(StringVal,16,7))) Val

)X)AccountNumber,

(SELECT LEFT(

      Val,PATINDEX('%[^0-9]%', Val+'a')-1) from(

 SELECT SUBSTRING(SUBSTRING(StringVal,35,5), PATINDEX('%[0-9]%', SUBSTRING(StringVal,35,5)), LEN(SUBSTRING(StringVal,35,5))) Val

)X)Pin,

(SELECT LEFT(

      Val,PATINDEX('%[^0-9]%', Val+'a')-1) from(

 SELECT SUBSTRING(SUBSTRING(StringVal,47,10), PATINDEX('%[0-9]%', SUBSTRING(StringVal,47,10)), LEN(SUBSTRING(StringVal,47,10))) Val

)X)Phone,

(SELECT LEFT(

        Val,PATINDEX('%[^0-9]%', Val+'a')-1) from(

 SELECT SUBSTRING(SUBSTRING(StringVal,60,15), PATINDEX('%[0-9]%', SUBSTRING(StringVal,60,15)), LEN(SUBSTRING(StringVal,60,15))) Val

)X)CountryID,

(SELECT LEFT(

          Val,PATINDEX('%[^0-9]%', Val+'a')-1) from(

 SELECT SUBSTRING(SUBSTRING(StringVal,78,20), PATINDEX('%[0-9]%', SUBSTRING(StringVal,78,20)), LEN(SUBSTRING(StringVal,78,20))) Val

)X)DateOfBirth,

(SELECT LEFT(

          Val,PATINDEX('%[^0-9]%', Val+'a')-1) from(

 SELECT SUBSTRING(SUBSTRING(StringVal,105,20), PATINDEX('%[0-9]%', SUBSTRING(StringVal,105,20)), LEN(SUBSTRING(StringVal,105,20))) Val

)X)NumberOfCallsMade

FROM #Temp

Let’s see what substrings are returning.

SELECT SUBSTRING(StringVal,16,7) FROM #Temp

SELECT SUBSTRING(StringVal,35,5) FROM #Temp

SELECT SUBSTRING(StringVal,47,10) FROM #Temp

SELECT SUBSTRING(StringVal,60,15) FROM #Temp

SELECT SUBSTRING(StringVal,78,20) FROM #Temp

SELECT SUBSTRING(StringVal,105,20) FROM #Temp

 

Substrings are returning invalid or unnecessary characters apart from the desired output. Hence we used PATINDEX to fetch only numbers.

This conversion can be achieved using some user defined function as well by reading row by row and character by character. As I said earlier there will be many ways to solve the issues.



Logical vs Physical Computed Columns

Many people have asked me about the difference between using and not using ‘PERSISTED’ in a computed column; and how do we know whether the column values are being stored logically or physically. 

I would like to share the answer to these questions and some more information here.

Let me start the with the definitions first. 

  • A computed column is basically a virtual column that is not physically stored in the table.
  • A computed column which is marked as PERSISTED will have its data stored physically.

Let’s see the difference between computed column and persisted computed column:


CREATE TABLE dbo.Products (

ProductID int IDENTITY (1,1) NOT NULL ,

QtyAvailable smallint ,

UnitPrice money ,

InventoryValue AS QtyAvailable * UnitPrice );

 

If we expand the table in object explorer, we see the column marked as computed with the datatype assigned based on the data of the columns & that also defines the computed column.


If we verify the properties of the table, it shows the following information.


Also, if we look into the system tables, it appears as a computed column.


Now drop this table and recreate it mentioning the keyword “PERSISTED” at the end of the table definition.

 

DROP TABLE dbo.Products

Go 

CREATE TABLE dbo.Products (

ProductID int IDENTITY (1,1) NOT NULL ,

QtyAvailable smallint ,

UnitPrice money ,

InventoryValue AS QtyAvailable * UnitPrice  PERSISTED)

 

Now, look at the properties, values or attributes as shown in the Picture-1, Picture-2 and Picture-3 along with the values of “sys.columns” system table. You will notice that there is no difference except for “is_persisted” column value in “sys.computed_columns” system table.

And if we talk about identifying whether it’s storing the data physically or logically, please continue reading the article.

 

CREATE TABLE dbo.Products (

ProductID int IDENTITY (1,1) NOT NULL ,

QtyAvailable smallint ,

UnitPrice money ,

InventoryValue AS QtyAvailable * UnitPrice );

 

--INSERTING 1000 ROWS

DECLARE @QtyAvailable SMALLINT, @UnitPrice MONEY

SELECT @QtyAvailable = 1

SELECT @UnitPrice = 10

while @QtyAvailable >=1 and @QtyAvailable <= 1000

BEGIN

    INSERT INTO dbo.Products values(@QtyAvailable, @UnitPrice)

    SELECT @QtyAvailable = @QtyAvailable + 1

    SELECT @UnitPrice = @UnitPrice + 1

END

GO

 

sp_spaceused 'dbo.Products'

GO

 

If you see, the data space it consumed is 24 KB out of the reserved 72 KB. Unused space remains 40 KB after the index size is being deducted.

Let us check the same for computed column with PERSISTED.

CREATE TABLE dbo.Products (

ProductID int IDENTITY (1,1) NOT NULL ,

QtyAvailable smallint ,

UnitPrice money ,

InventoryValue AS QtyAvailable * UnitPrice  PERSISTED)

 

--INSERTING 1000 ROWS

DECLARE @QtyAvailable SMALLINT, @UnitPrice MONEY

SELECT @QtyAvailable = 1

SELECT @UnitPrice = 10

while @QtyAvailable >=1 and @QtyAvailable <= 1000

BEGIN

    INSERT INTO dbo.Products values(@QtyAvailable, @UnitPrice)

    SELECT @QtyAvailable = @QtyAvailable + 1

    SELECT @UnitPrice = @UnitPrice + 1

END

GO

sp_spaceused 'dbo.Products'

GO


After using “PERSISTED”, the computed column values are being stored physically in the disk as you can see the difference in the space consumption.

Hope you liked this article & found it informative. Please do not hesitate to ask me for any further clarification.

Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...