Friday, August 28, 2020

XML in SQL Server

This is my second post on XML after 10 years. I was contented with the earlier post however I experienced more over these years and learned many things which I wanted to share with the current and future generations.

Let me begin with the basics.

XML is a file extension for an Extensible Markup Language (XML) represents information in a hierarchical (tree) structure in a simple text format. An XML document consists of XML elements and all elements in an XML document can contain sub-elements, text, or attributes. The hierarchical structure represented by an XML document starts at the root element and branches to the lowest level of elements.

There are several modes that change the shape of XML format while generating it through SQL Server using FOR XML clause.

Instead of reinventing the wheel, I would quote the MSDN information here for the descriptions with the examples I prepared.

Here is the data for the tryout.

VideoID VideoDesc vcVideoLocation
842735 SQL D:\Partition23\20200101\73899mpart
842736 Java   D:\Partition23\20200101\74900mpart
842738 XML D:\Partition23\20200101\74901mpart
842739 Hive D:\Partition23\20200101\74901mpart
842740 Sqoop D:\Partition23\20200101\74900mpart
842742 Impala D:\Partition23\20200101\73899mpart
842743 SQL D:\Partition23\20200101\74900mpart
842744 Hadoop D:\Partition23\20200104\74791mpart
842745 SQL D:\Partition23\20200101\74901mpart
842746 Hive D:\Partition23\20200101\73899mpart
842747 Sqoop D:\Partition23\20200104\74791mpart
842748 Impala D:\Partition23\20200101\74900mpart
842749 Spark D:\Partition23\20200101\74900mpart
842750 HBase D:\Partition23\20200101\73899mpart
842751 Scala D:\Partition23\20200104\74791mpart
842752 Hive D:\Partition23\20200104\74791mpart
842753 SQL D:\Partition23\20200101\74901mpart
842756 Impala D:\Partition23\20200101\74901mpart


SQL Code:

CREATE TABLE VideoData(VideoID BIGINT, VideoDesc VARCHAR(10), vcVideoLocation NVARCHAR(MAX))

INSERT INTO VideoData(VideoID, VideoDesc, vcVideoLocation) VALUES
(842735,'SQL','D:\Partition23\20200101\73899mpart'),
(842736,'Java','D:\Partition23\20200101\74900mpart'),(842738,'XML','D:\Partition23\20200101\74901mpart'),(842739,'Hive','D:\Partition23\20200101\74901mpart'),(842740,'Sqoop','D:\Partition23\20200101\74900mpart'),(842742,'Impala','D:\Partition23\20200101\73899mpart'),(842743,'SQL','D:\Partition23\20200101\74900mpart'),(842744,'Hadoop','D:\Partition23\20200104\74791mpart'),(842745,'SQL','D:\Partition23\20200101\74901mpart'),(842746,'Hive','D:\Partition23\20200101\73899mpart'),(842747,'Sqoop','D:\Partition23\20200104\74791mpart'),(842748,'Impala','D:\Partition23\20200101\74900mpart'),(842749,'Spark','D:\Partition23\20200101\74900mpart'),(842750,'HBase','D:\Partition23\20200101\73899mpart'),(842751,'Scala','D:\Partition23\20200104\74791mpart'),(842752,'Hive','D:\Partition23\20200104\74791mpart'),(842753,'SQL','D:\Partition23\20200101\74901mpart'),(842756,'Impala','D:\Partition23\20200101\74901mpart')


XML RAW:

Each row in the result set is taken as one element with your columns being the attributes. 


SELECT
VideoID, VideoDesc, vcVideoLocation
FROM VideoData
FOR XML RAW;

If you look at the output, each row from the table is represented as a single element in XML and is also represented by the keyword <row>. As stated in the beginning, the elements can contain sub-elements, text or attributes; here the table’s columns are being treated as attributes. What if the keyword ELEMENTS mentioned after FOR XML RAW? Let’s check it out. 

SELECT VideoID, VideoDesc, vcVideoLocation
FROM VideoData
FOR XML RAW, ELEMENTS;

Now, each row element has column names as sub-elements. 

FOR XML RAW and FOR XML RAW, ELEMENTS are returning a shape that is slightly different.


XML AUTO:

Returns query results in a simple, nested XML tree. Each table in the FROM clause for which at least one column is listed in the SELECT clause is represented as an XML element. 

Each row of the table will be represented as an element by table name.


SELECT
VideoID, VideoDesc, vcVideoLocation
FROM VideoData
FOR XML AUTO; 


Let’s add “ELEMENTS” to it and see how the output changes.

SELECT VideoID, VideoDesc, vcVideoLocation
FROM VideoData
FOR XML AUTO, ELEMENTS;

Table name as an element for each row and each element has column names as sub-elements

 

XML PATH:

Provides a simpler way to mix elements and attributes, and to introduce additional nesting for representing complex properties. Table columns are passed as child elements.

SELECT VideoID, VideoDesc, vcVideoLocation
FROM VideoData
FOR XML PATH;


If you look at the outcome of “FOR XML RAW, ELEMENTS” and “FOR XML PATH” there is no difference at all. Both are returning the same elements and sub-elements. 

SELECT VideoID, VideoDesc, vcVideoLocation
FROM VideoData
FOR XML PATH, ELEMENTS; 

Again, the output is as same as “FOR XML RAW, ELEMENTS” and “FOR XML PATH”.


Look at the below example -

SELECT VideoID, VideoDesc, vcVideoLocation
FROM VideoData
FOR XML PATH('VideoInfo');

User can replace the default keyword “row” with any desired information. In the above example, since the data is related to Videos, I mentioned “VideoInfo” for the readability. It solves no other purpose.

Synopsis: RAW will return the table’s row as an element and column names being the attributes. AUTO will return the table’s name as an element and the element contains the entire row and the column names are the attributes. PATH returns the row as element and columns as child or sub-elements.

Now let’s go back to the data; the table contains video information such as on which topic the video is created and where the video is located. There is more than one video on the same subject and there are several videos on the same location. 

We can create different shapes apart from the above by using the existing modes, either clustered/grouped by topic of the video or by the location, similar to bitmap fashion. Look at the below to understand easily.


SELECT
vcVideoLocation AS '@path', (
              SELECT VideoID AS [index], VideoDesc FROM VideoData t1
              WHERE t1.vcVideoLocation = t2.vcVideoLocation
              FOR XML PATH('VideoInfo'),TYPE
                                                          )
              FROM (
                      SELECT DISTINCT vcVideoLocation FROM VideoData) t2
                      FOR XML PATH('vLocation'), ROOT('data') 

“data” is the main element for the whole table data, the location is sub-element; and in each location, there are several videos that are being treated as attributes within the user-defined sub-element “VideoInfo”. Look at the results below.




In the above example, the data is grouped by “location”. Similarly, we can shape it by grouping by “Subject” (i.e. Video Description).

SELECT VideoDesc AS '@Desc', (
              SELECT VideoID AS [index], vcVideoLocation FROM VideoData t1
              WHERE t1.VideoDesc = t2.VideoDesc
              FOR XML PATH('VideoInfo'),TYPE
                                                  )
              FROM (
                      SELECT DISTINCT VideoDesc FROM VideoData) t2
                      FOR XML PATH('VideoType'), ROOT('data')

Output: 




Hope you find this article helpful.


Thursday, August 27, 2020

Json in SQL Server - Load and convert Json data

Data is the backbone of the business, and it can be in any sort of format. It can be lying in your relational databases or it can be a tweet in the twitter or a post in the Facebook or a message in the Instagram. Worldwide data is expected to hit 175 zettabytes by 2025 and over 80% of this data will be in an unstructured format due to social media and mobile applications. 

Unstructured data usually refers to information that doesn't reside in a traditional row-column database. For example, data stored in XML and JSON documents, CSV files, and Excel files are all unstructured. RDBMSs like Oracle SQL*Plus, Microsoft SQL Server, MySQL are adopting the changes and introducing the functionalities that can handle unstructured data.

My recent articles focused on such newly introduced functions that either helps in analyzing the data, importing or converting the unstructured data into a readable and meaningful format.

In this article, we will see how a Json formatted file be converted into usual row-column table format in SQL Server 2016 using OpenJson function. Json (JavaScript Object Notation) is open standard file format that consists of attribute-value pairs and array data types. Look at the below example.

Filename: “test.json”

[{"Name":"Zafar Iqbal","surname": "Farooq", "Age":30, "Address":["142", "Street 29c", "Rashidiya", "Dubai"]},

{"Name":"Tahir","surname": "Farooq", "Age":24, "Address":["12C", "Happy Street", "Al Quoz", "Dubai"]}]

 

DECLARE @Json VARCHAR(MAX)
SELECT @Json = BulkColumn
FROM OPENROWSET (BULK 'D:\MyDocuments\blog posts\JSON-SQLServer\test.json', SINGLE_CLOB) AS j 

SELECT * FROM OPENJSON (@Json)
     WITH(    Name VARCHAR(20) '$.Name',
             LastName VARCHAR(20) '$.surname',
              Age INT '$.Age',
              Building VARCHAR(10) '$.Address[0]',
              Street VARCHAR(20) '$.Address[1]',
              Area VARCHAR(20) '$.Address[2]',
              City VARCHAR(20) '$.Address[3]')

Explanation:

  • In the above code, we are importing the “test.json” contents into a variable named @Json.
     
  • OPENROWSET function is invoked to read the data from a file. Since it is a table-valued function, it returns a table with a single column and the column name must be “BulkColumn”.

  • SINGLE_CLOB reads a file as VARCHAR(MAX).

  • “As” in the above code is to provide a correlation name for the bulk rowset in the from clause. The alias name can be anything.

  • Once we populated the json data into a variable, we can proceed further to convert the data into rows and columns by parsing it using OpenJSON function. OpenJSON function accepts JSON as a parameter and it returns the key:value pairs of the first-level elements in the JSON.

  • Address field in the Json is an array data type consists of building number, street, area and city which we split during the query and the output is –

Wednesday, August 26, 2020

Sqoop Complete Tutorial Part-4

This is the continuation part of "Sqoop Complete Tutorial". If you want to read -


11) Importing data from MySQL to HDFS based on SQL query 

sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db
--username root
--password cloudera
--exclude-tables "order_items" 

Append & LastModified

The following is as per the Sqoop documentation.

Sqoop supports two types of incremental imports: append and lastmodified. You can use the --incremental argument to specify the type of incremental import to perform.

You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value.

An alternate table update strategy supported by Sqoop is called lastmodified mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.

At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

Below two are examples of incremental imports.

12) Importing data from MySQL to HDFS - Incremental Import

Assume that the below is the "emp" data which is in MySQL's "EMPDEPT" database.


Importing it into HDFS using the below command

sqoop import
--connect jdbc:mysql://localhost:3306/empdept
--table emp
--username root
--password cloudera
-m1

After importing it into HDFS, insert two rows in MySQL's table.

These two records need to be updated/appended in HDFS location. To achieve it use the below command.

sqoop import
--connect jdbc:mysql://localhost:3306/empdept
--table emp
--username root
--password cloudera
--check-column empno
--incremental append
--last-value 7521

13) Import and append data from MySQL to HDFS

Consider the same example specified above.

As per the data, the last entry in the table is 1981-09-28, and if we update one or two records, as shown below. 

The records which are updated in the source table should also need to be updated in the destination (HDFS). The command to do that -

sqoop import
--connect jdbc:mysql://localhost:3306/empdept
--table emp
--username root
--password cloudera   
--check-column HireDate
--incremental lastmodified
--last-value 1981-09-28
--target-dir /user/cloudera/empdept


Please click here for the next part


Find SQL scheduled jobs to run during a specific time range

There are times when we need to know which jobs are going to run in the upcoming half an hour or so. It is quite possible and easy when you invoke the job activity monitor. However, if you want to know through a query or if you want to make a report, here is the code.


SELECT
j.name, dbo.agent_datetime(next_run_date, next_run_time),
STUFF(STUFF(RIGHT(replicate('0', 6) +  CAST(active_start_time AS VARCHAR(6)), 6), 3, 0, ':'), 6, 0, ':'),
STUFF(STUFF(RIGHT(replicate('0', 6) +  CAST(active_end_time AS VARCHAR(6)), 6), 3, 0, ':'), 6, 0, ':'),
              CASE
               WHEN freq_type = 4 THEN 'Daily'
               WHEN freq_type = 8 THEN 'Weekly'
               WHEN freq_type = 16 THEN 'Monthly'
               WHEN freq_type = 32 THEN 'Monthly'
              END frequency,
              'every ' + CAST (freq_interval AS VARCHAR(3)) + ' day(s)'  Days,
              CASE
               WHEN freq_subday_type = 2
               THEN ' every ' + CAST(freq_subday_interval AS VARCHAR(7)) + ' seconds'
               WHEN freq_subday_type = 4
               THEN ' every ' + CAST(freq_subday_interval AS VARCHAR(7)) + ' minutes'
               WHEN freq_subday_type = 8
               THEN ' every ' + CAST(freq_subday_interval AS VARCHAR(7)) + ' hours'
               else ' starting at '
               +STUFF(STUFF(RIGHT(replicate('0', 6) + 
CAST(active_start_time AS VARCHAR(6)), 6), 3, 0, ':'), 6, 0, ':')             END time
FROM msdb..sysjobschedules S
JOIN msdb..sysjobs J ON S.job_id = J.job_id
JOIN msdb..sysschedules sc ON sc.schedule_id = s.schedule_id
WHERE CONVERT(VARCHAR(10), dbo.agent_datetime(next_run_date, next_run_time),101)= CONVERT(VARCHAR(10), GETDATE(),101)
AND next_run_date > 0 AND next_run_time > 0
AND CONVERT(VARCHAR(5),dbo.agent_datetime(next_run_date,next_run_time),114)>='16:00'
AND CONVERT(VARCHAR(5),dbo.agent_datetime(next_run_date,next_run_time),114)<='20:30'
AND j.enabled = 1
ORDER BY dbo.agent_datetime(next_run_date, next_run_time) 


This will return the data in the below format.

 

Name: Name of the job

Next run date-time: Next occurrence of the job that falls between the time frame mentioned in the query i.e., 16:00 to 20:30 hours.

Active Start Time: Job starting point in the day.

Active End Time: Job ending point in the day.

Frequency: Daily/Weekly/Monthly

Days: Scheduled days (every day, every 2 days, every x number of days, etc.)

Time: Interval between each occurrence.



Tuesday, August 25, 2020

Sqoop Complete Tutorial Part-3

This is the continuation part of "Sqoop Complete Tutorial". If you want to read -

Part-1, please click here


Part-2, please click here


5) Importing all the tables but excluding few tables from MySQL to HDFS 

sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db
--username root
--password cloudera
--exclude-tables "order_items" 

Note: There is no way to import only a few selected tables to HDFS. The only option available is "exclude-tables". If there is more than one table, then you can mention them in the same property and separate them using comma (,) the way given below.

sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db
--username root
--password cloudera
--exclude-tables "order_items,orders" 


6) Import part of the data from a table from MySQL to HDFS using WHERE condition. 

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--username root
--table emp
--where ename='scott' 
-m 1


7) Importing a few columns from a table from MySQL database to HDFS. 

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--table EMP  
--columns "EmpID,Sal”
-m 1


8) Importing and compressing the table's data and store it in HDFS. 

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--username root 
--table emp
--target-dir empdata/emp
--compress -m 1


9) Importing data from MySQL to HDFS with a specific delimiter.

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--table EMP
--columns "EmpID,Sal”
--fields-terminated-by '@'
-m 1

Note: This will throw an error if you do not provide the username and password to connect to MySQL. Once data is imported, read the contents of the file either using Hue browser or from Hadoop terminal using "-cat" the usual Linux command. You'll find '@' delimitation between the columns.


10) Import data by splitting it based on a specific column.

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--table EMPDEPT
--username root
--P
--split-by deptno;

Note: This is one of the most important processes when we deal with a huge volume of data. If we import such huge data, the number of mappers or splits will be very less which might take a lot of time to complete and it can affect performance. Hadoop is meant for parallelism and to utilize it we use the "split" command on a column to increase more number of splits. That means more number of parallel processes in place. 

In the above example, I have employee and department data in my database which I combined and created a single table as EmpDept. Now, there are few employees in each department, say 10 departments and 50 employees in each department. Once I execute the above command, due to my data, it generates 10 splits based on the "deptno" and runs 10 simultaneous importing jobs at once. This is the way to improve import performance to achieve faster parallelism.


Sqoop Complete Tutorial Part-2

This is the continuation part of "Sqoop Complete Tutorial". If you want to read Part-1, please click here. We are about to discuss the following in the current and in the upcoming parts.

Importing data from MySQL to HDFS

  • Importing a table from MySQL to the default location of the HDFS
  • Importing a table from MySQL to user-specified location in HDFS
  • Importing all the tables from MySQL to the default location of the HDFS
  • Importing all the tables from MySQL to user-specified location in HDFS.
  • Importing all the tables but excluding few tables from MySQL to HDFS.
  • Importing part of the data from a table from MySQL to HDFS using WHERE condition.
  • Importing a few columns from a table from MySQL to HDFS
  • Importing and compressing the table's data to HDFS
  • Importing data by specifying the column delimiter.
  • Importing data by splitting the data based on a specific column.
  • Import the data from MySQL to HDFS based on a SQL Query.
  • From MySQL incremental import data into HDFS.
  • Import and append the data.
  • Import data from MySQL and store in a sequential file format in HDFS.
  • Read the contents of the source database without importing it.
  • Get the list of the databases.
The format of the import process is the same however some additional keywords will be used if the target is not HDFS.

Importing data from MySQL to Hive
  • Importing a table from MySQL to Hive's default database.
  • Importing a table from MySQL to Hive's user's database.
  • Importing all the tables from MySQL to Hive
  • Importing all the tables but excluding few from MySQL to Hive
  • Import the table's structure from MySQL to Hive without data.
  • Import the table from MySQL to Hive with the specific delimiter.
Sqoop Automation
  • Sqoop - Creating a job
  • Sqoop - Displaying the list of the jobs
  • Sqoop - Displaying the properties of a job
  • Sqoop - Executing the job
  • Sqoop - Deleting the job

Importing data from MySQL to HBase
  • Importing a table from MySQL to HBase
  • Importing a table with some selected columns from MySQL to HBase

Ok, let us begin the exercises with the above-mentioned points one by one.

Importing data from MySQL to HDFS


1) Importing a table from MySQL to HDFS default location.

sqoop import 
        --connect jdbc:mysql://localhost/retail_db 
        --table categories 
        --username root 
        --password cloudera 
        -m 1


Explanation: "Sqoop Import" is the keyword to be used while import. Sqoop is a JDBC utility to move data between Hadoop Ecosystems and relational databases. Username and password must be provided to connect to the RDBMS. 

Assigning mappers is the user's choice. Users can assign a number of mappers based on the size of the data that is being imported. In case the password not to be disclosed in the connection string, the user can simply pass -P argument without providing any password. During the run-time, it will prompt you to enter the password. Refer to below -

Importing a table from MySQL to HDFS default location - Hiding password.

sqoop import 
        --connect jdbc:mysql://localhost/retail_db 
        --table categories 
        --username root 
        --P
        -m 1

The above command will create the directory named “categories” and the table data will be stored as a file (part-m-00000) in the specified directory "/user/cloudera"

2) Importing a table from MySQL to HDFS user-specified target location.

sqoop import
--connect jdbc:mysql://localhost/retail_db
--table categories
--username root
--password cloudera
--target-dir 'retail_db'
-m 1

3) Importing all the tables from MySQL to HDFS default location

sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db
--username root
--password cloudera

Note: All the tables will be placed in the root directory as files.

4) Importing all the tables from MySQL to HDFS user-specified target location.

sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db
--username root
--password cloudera
--warehouse-dir 'retaildb'

Note: All the tables will be placed in the specified directory. The keyword "target-dir" will not work with import-all-tables. No mappers assigned since a single mapper is insufficient to import data from the "retail_db" database as the volume is higher than the HDFS block size.

 

Please click here for the next part.


Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...