Showing posts with label PATINDEX. Show all posts
Showing posts with label PATINDEX. Show all posts

Monday, July 27, 2020

Handling "Json" and "Unstructured" Data in SQL

The below is to understand how we can handle JSON data in SQL Server 2016 version on-wards.

Sample JSON data is:

  {"accountNumber": 2020112, "pin": 2525},

  {"accountNumber": 2567899, "pin": 1462}

  {"accountNumber": 6789925, "pin": 2614}

  {"accountNumber": 9925678, "pin": 6142}

This can be extracted into columns easily in SQL Server with the help of following code:

However, what if the data is not in the exact JSON format i.e. no curly braces, no colon (:) to indicate name-value pairs and no square bracket to hold arrays and values separated by comma (,), in addition to this if it is to be worked on prior to SQL Server 2016 versions?

Sample data:

(accountNumber=2020112)(accountPin=2525)(Phone=+12345678)(countryId=121) (DateOfBirth=19810726)(NumberOfCallsMade=381)

(accountNumber=202019)(accountPin=98291)(Phone=)(countryId=1881) (DateOfBirth=19860526)(NumberOfCallsMade=31)

If you look at the data, there are two rows and are not in same lengthy strings since the values lengths are different. No value provided for Phone attribute in the second row.

The reason behind highlighting these points is, recently I got to work with this data and SUBSTRING function in SQL Server alone is not much a help. I managed to create the report at the end by using PATINDEX along with SUBSTRING functions.

This is to let you know that, within the available resources we can sort out the things even if it looked complicated and unsolvable.

Let me add these two rows in a TEMP table.

 

Now let’s see how can we convert these strings into columns. 

SELECT

(SELECT LEFT(

     Val,PATINDEX('%[^0-9]%', Val+'a')-1) from(

 SELECT SUBSTRING(SUBSTRING(StringVal,16,7), PATINDEX('%[0-9]%', SUBSTRING(StringVal,16,7)), LEN(SUBSTRING(StringVal,16,7))) Val

)X)AccountNumber,

(SELECT LEFT(

      Val,PATINDEX('%[^0-9]%', Val+'a')-1) from(

 SELECT SUBSTRING(SUBSTRING(StringVal,35,5), PATINDEX('%[0-9]%', SUBSTRING(StringVal,35,5)), LEN(SUBSTRING(StringVal,35,5))) Val

)X)Pin,

(SELECT LEFT(

      Val,PATINDEX('%[^0-9]%', Val+'a')-1) from(

 SELECT SUBSTRING(SUBSTRING(StringVal,47,10), PATINDEX('%[0-9]%', SUBSTRING(StringVal,47,10)), LEN(SUBSTRING(StringVal,47,10))) Val

)X)Phone,

(SELECT LEFT(

        Val,PATINDEX('%[^0-9]%', Val+'a')-1) from(

 SELECT SUBSTRING(SUBSTRING(StringVal,60,15), PATINDEX('%[0-9]%', SUBSTRING(StringVal,60,15)), LEN(SUBSTRING(StringVal,60,15))) Val

)X)CountryID,

(SELECT LEFT(

          Val,PATINDEX('%[^0-9]%', Val+'a')-1) from(

 SELECT SUBSTRING(SUBSTRING(StringVal,78,20), PATINDEX('%[0-9]%', SUBSTRING(StringVal,78,20)), LEN(SUBSTRING(StringVal,78,20))) Val

)X)DateOfBirth,

(SELECT LEFT(

          Val,PATINDEX('%[^0-9]%', Val+'a')-1) from(

 SELECT SUBSTRING(SUBSTRING(StringVal,105,20), PATINDEX('%[0-9]%', SUBSTRING(StringVal,105,20)), LEN(SUBSTRING(StringVal,105,20))) Val

)X)NumberOfCallsMade

FROM #Temp

Let’s see what substrings are returning.

SELECT SUBSTRING(StringVal,16,7) FROM #Temp

SELECT SUBSTRING(StringVal,35,5) FROM #Temp

SELECT SUBSTRING(StringVal,47,10) FROM #Temp

SELECT SUBSTRING(StringVal,60,15) FROM #Temp

SELECT SUBSTRING(StringVal,78,20) FROM #Temp

SELECT SUBSTRING(StringVal,105,20) FROM #Temp

 

Substrings are returning invalid or unnecessary characters apart from the desired output. Hence we used PATINDEX to fetch only numbers.

This conversion can be achieved using some user defined function as well by reading row by row and character by character. As I said earlier there will be many ways to solve the issues.



Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...