Tuesday, August 25, 2020

Sqoop Complete Tutorial Part-1

Sqoop (SQL-to-Hadoop) is a Hadoop ecosystem component and an ETL tool that offers the capability to extract data from various structured data stores such as relational databases with the help of map-reduce. This command-line interpreter works efficiently to transfer a huge volume of the data to the Hadoop file system (HDFS, HBase, and Hive). Similarly, it exports the data from Hadoop to SQL.

Sqoop offers parallel processing as well as fault tolerance along with the automation functionality that helps in scheduling the data transfer jobs.


Sqoop Import

Imports individual or all the tables from RDBMS to HDFS or to Hive or HBase. Regardless of the target location, the table data will be stored as text files.

In HDFS, each row of the table is treated as a record, each table is treated as a sub-directory, and table data is stored as text files. Users can opt to store the table data either in text format or as binary data or in Avro format or in row-column file format or in Sequence file formats. Users can also have the privilege to opt if the data to be compressed.

In Hive, the target database must be created before importing all the tables from RDBMS. If not, the tables will be imported into the Hive’s ‘default’ database.

Sqoop Import will take care of the table’s creation if the table does not exist in Hive metastore.

Sqoop Export

Exports the files from Hadoop distributed file system to RDBMS. Sqoop will parse the contents of the files and convert them into rows and tables. Each line will be split into multiple columns based on the delimiter specified.

Before we start working with Sqoop, please note the following:

Sqoop is a command-line interface and the default HDFS terminal is used to issue Sqoop commands. There is hardly any need for GUI since the scope of Sqoop is limited to import and export the data. Please also note that Sqoop is case sensitive. One has to be very careful with the names of the table, column, directory, sub-directory, source or target locations along with the Sqoop reserved keywords.





Monday, August 24, 2020

Creating Linked Server for PostgreSQL in MSSQL

Creating a linked server in Microsoft SQL Server to connect PostgreSQL

1) Download the PostgreSQL Unicode (x64) driver to support ODBC.

2) Go to ODBC Data Source Administrator in your machine and create the system DSN.


3) Go to SQL Server Management Studio and execute the below command in a new query window.

EXEC master.dbo.sp_addlinkedserver 

@server = N'POSTGRES', 

@srvproduct=N'PostgreSQL35W', 

@datasrc='PostgreSQL35W',

@provider=N'MSDASQL', 

@provstr=N'Driver=PostgreSQL 64-bit ODBC Drivers;uid=yourUserName;Server=LocalHost;database=YourDatabaseName;pwd=DBPassword'

4) Once executed successfully, go to Object explorer.

5) Expand the Server Objects and then Linked Servers. 

Now you will be able to see the newly created linked server. If you expand further, you will see the database name in the catalogs.


Please do let me know if you are facing any issues.


Automation - PostgreSQL Restore Data

There are three methods to backing up PostgreSQL databases

  • SQL dump
  • File System Level backup
  • Continuous archiving

SQL Dump command used to take a full database backup. The backup file consists of SQL commands which will create the database when it is gets restored. The below command is used to take a backup through the CLI.

pg_dump dbname > dumpfile

The dumpfile can be restored in other testing or staging servers by using the following command.

psql dbname < dumpfile


Now, if we want both to be handled in a single transaction, like taking a backup and restoring it in another environment -

pg_dump -h host1 dbname | psql -h host2 dbname

The ability of pg_dump and psql to write to or read from pipes makes it feasible to dump a database immediately from one server to another in a single session. In other words, pg_dump helps in creating the dump of "dbname" database from "host1" server, and psql helps in restoring the same into database "dbname" in "host2" server.

These are all good for instant operation however we all prefer to automate both database backup and restore operations. But, it is a bit difficult in PostgreSQL to achieve such a scheduled or automated process to backup and restoration. Free tools may not offer much in this case and most of the companies might not prefer to install any third-party tools without knowing how safe they are. 

So, this leaves us to create windows scheduler tasks to automate them.

Automating the data restoration in another instance/server:

Go the server in which you want to restore the database and create the windows task with the help of the below steps.

1) Go to Task Scheduler.

2) Right-click on "Task Scheduler Library"

3) Go To Actions

4) Click New

5) Action should be "Start a program"

6) In the Settings, Program/script - copy the below executable file.

C:\Windows\System32\cmd.exe

7) In the "Add arguments (optional)" add the following command.

/c "psql -U yourUserName yourDatabaseName  < D:\dbname.sql"

8) Click OK

9) Go to "Triggers" and Select "New" to configure the schedule

10) Select "Daily" and select the recurrence details.

11) Tick on Stop task if it runs longer than "1 Hour" (this is optional)

12) Click OK, FINISH.


Automating the backup of the database.

All the above steps to be followed and the command should be replaced with -

/c "pg_dump -h localhost -p 5432 -U yourUserName yourDatabaseName > D:\dbname.sql"

Here pg_dump is connecting the localhost using default port number 5432 to the database you specified and the backup file will be copied to "D" drive. Change these params based on your environment.


If you are not familiar or happy with such implementation, simply you can install some free GUI tools available for database backups. 

For example, SQLBackupAndFTP

This is free and supports two database backup.


Hope you find this article helpful, do let me know if you need my assistance.

Sunday, August 23, 2020

Aggregate Functions in Analytic Context

The functions SUM, AVG, COUNT, MIN, and MAX are well-known aggregate functions that we use every day. They are to compute/summarize the multiple rows by grouping them and provide a single summary value. 

However, when it comes to analysis, the aggregate functions carry out twofold responsibility: The same aggregate function which computes on each row can also help in computing based on the range of rows or partitions.


Considering the "EMP" data, the following aggregations/analysis can be done based on Salary.

1) Identify the total salary costing to the company.

2) Identify the total salary budget for each department.

3) Identify the total salary budget for each role in each department.

This means, in the sales department, how much salary is being paid to Salesmen and to other teams within the department.

Point (1) can be achieved with a simple aggregation however the other two are not limited to one result value and operate on a certain range or partition or window where the input rows are ordered and grouped using flexible conditions like data window/range of rows expressed through an OVER() clause. The range or partition in the above scenario is "department" for the 2nd point and "Job" for the 3rd one.

Here is the data for an example.


Aggregate functions with OVER clause:

SELECT DISTINCT [JOB]
      ,a.[DEPTNO],
  b.[DNAME],
  b.LOC,
  SUM(SAL) OVER(PARTITION BY a.DeptNo) DepartmentwiseSalaryCTC,
  SUM(SAL) OVER(PARTITION BY a.Job) JobwiseSalaryCTC
  FROM [TestDB1].[dbo].[EMP] a
  JOIN [TestDB1].[dbo].[DEPT] b ON a.DeptNo = b.DeptNo


If you look at the summary, employees salary cost to company based on department and job is retrieved. This way, we can use all the aggregate functions with the OVER clause to analyze the data in the most possible ways.

SELECT DISTINCT [JOB], a.[DEPTNO], b.[DNAME],
  SUM(SAL) OVER(PARTITION BY a.DeptNo) DepartmentSalaryCTC,
  SUM(SAL) OVER(PARTITION BY a.Job) JobSalaryCTC,
  AVG(SAL) OVER(PARTITION BY a.DeptNo) AVGSalaryPerDept,
  AVG(SAL) OVER(PARTITION BY a.Job) AVGSalaryPerRole,
  MAX(SAL) OVER(PARTITION BY a.DeptNo) MaxSalaryInEachDept,
  MAX(SAL) OVER(PARTITION BY a.Job) MaxSalaryInEachRole,
  MIN(SAL) OVER(PARTITION BY a.DeptNo) MinSalaryInEachDept,
  MIN(SAL) OVER(PARTITION BY a.Job) MinSalaryInEachRole
  FROM [TestDB1].[dbo].[EMP] a
  JOIN [TestDB1].[dbo].[DEPT] b ON a.DeptNo = b.DeptNo


Please note that the syntax or the way of analysis is the same in any RDBMS and Big Data technologies like Hive and Impala.


Thursday, August 20, 2020

SQL Server 2016, 2017 and 2019 New Functions

There are several string and analytical functions introduced with the newer versions of SQL Server and are listed below.


STRING_SPLIT

is a table-valued function introduced in SQL Server 2016 (13.x) that splits a string into rows of sub-strings, based on a specified separator character.

Discussed about its functionality and usage in my previous blog. Please click here for details


APPROX_COUNT_DISTINCT

This function introduced in SQL Server 2019 (15.x) returns the approximate number of unique non-null values in a group.

Discussed about its functionality and usage in my previous blog. Please click here for details and here.


STRING_AGG

This function is introduced in SQL Server 2017 (14.x) Concatenates the values of string expressions and places separator values between them. The separator is not added at the end of string.

It is equivalent to MySQL’s GROUP_CONCAT and Oracle's LISTAGG function.

Discussed about its functionality and usage in my previous blog. Please click here for details


CONCAT_WS 

This function is introduced in SQL Server 2017 (14.x); it returns a string resulting from the concatenation, or joining, of two or more string values in an end-to-end manner. It separates those concatenated string values with the delimiter specified in the first function argument. (CONCAT_WS indicates concatenate with separator.)

It requires 3 to 254 arguments to be passed.

Example:

SELECT CONCAT_WS(',','One','Two','Three','Four')

Result: One,Two,Three,Four


TRIM

This function is introduced in SQL Server 2017 (14.x); it eliminates the spaces in a given string. 

SELECT TRIM('       exampleText     ');

Prior to this version, there is the following way to trim the spaces. 

SELECT LTRIM(RTRIM('     exampleText    ');


DROP TABLE IF EXISTS  

In spite of the fact that there are some workarounds and strategies to drop the database objects if exists, this is the easy and direct approach. This is introduced in SQL Server 2016 (13.x).

DROP TABLE IF EXISTS TableName;  

In the earlier versions the following is approach.

-- To drop the temporary table

IF OBJECT_ID(N'tempdb..#temp1', N'U') IS NOT NULL   

DROP TABLE #temp1;  

GO


-- To drop the permanent table

IF OBJECT_ID('dbo.Emp', 'U') IS NOT NULL 

  DROP TABLE dbo.Emp; 

GO

Hope you find this article helpful.


Wednesday, August 19, 2020

Approximate Count of Unique Values in SQL Server

In my previous blog, I discussed the approximate count of unique values for which a predefined function is available in Impala (NDV), Oracle 12c (APPROX_COUNT_DISTINCT) and in MongoDB (estimatedDocumentCount()). 

However, I didn't mention about SQL Server since there is no such function available in the versions I am using. However, the function has been introduced in SQL Server 2019.

The function Approx_Count_Distinct does the same thing as mentioned in my previous blog. 

If you want to know more about this function, I would suggest you click here. This guy (Rajendra Gupta) didn't leave any opportunity to discuss this functionality as he concentrated on each piece and clarified each and every point.

 

SQL Server 2016 - Compatibility Issues - Invalid object name

If you are using SQL Server 2016 version and still you are getting the error Invalid object name 'STRING_SPLIT' as shown below while using the functions which are newly introduced (with the same version) -


that means the database compatibility is set to the level in which SQL Server is unable to find the functions. When we migrate/restore the databases to a newer version and try to use the newly introduced functions without changing the database compatibility level such errors will occur. 

In this case, we are going to change the compatibility level from 120 to 130 so as to String_Split function works.
 

Follow the below steps:

1) Go to Object explorer
2) Right-click on the database, 
3) Go to Options
4) Change the Compatibility Level from SQL Server 2014 (120) to SQL Server 2016 (130)
5) Click OK and execute the query again.


Word Count in SQL Server - String_Split Usage

STRING_SPLIT is a table-valued function that splits the words from the input text and returns into rows. This function has been introduced in SQL Server 2016.

String_split will help in many ways.

1) You can get the word count from the given text.

2) You can pass the string values to a query.


Word Count in SQL

SELECT COUNT(value) 
FROM 
    STRING_SPLIT('This is just a text taken from the most popular book', ' ')


String_Split in Queries to provide multiple values in where clause.

DECLARE @EmpIDs NVARCHAR(100) = '7369,7499,7521'
SELECT Ename
FROM   Emp
WHERE  Empno IN (SELECT value FROM STRING_SPLIT(@EmpIDs, ','))



If you are looking for a word-count program using HiveQL, then click here.
If you are looking for a word-count program using Pig, then click here.

Identify the active node in SQL Server Cluster

Probably most of the DBAs are already aware of how to identify which is active SQL Server node in the clustered environment, however, this will certainly help the beginners.

The dynamic management views and functions return server state and database state information. A user who has permissions to VIEW SERVER STATE on the server and VIEW DATABASE STATE on the database will be able to query against these views.

SQL Servers nodes information will be available in dm_os_cluster_nodes.


SELECT [NodeName],
[Status],
[status_description],
[is_current_owner]
FROM [master].[sys].[dm_os_cluster_nodes]


As per the above screenshot, there are 4 nodes and SQLNode1 is the active node in the cluster.



Tuesday, August 18, 2020

Reason to convert datetime to bigint - What is epoch time

In my previous article, I had discussed the conversion of bigint value to date time in both PostgreSQL and SQL Server. It raises several questions 

1) Why it was converted into bigint in the first place?

2) How the bigint value is related to the date-time?

3) How to convert a date into bigint value?

Both BigInt and DateTime takes 8 bytes of storage. Moreover, DateTime looks more readable and understandable. However, bigint is the best choice to be a primary key as SQL engines cannot differentiate between DateTime values that are within a range. So, in case if we want the date column to be a primary key, we need to convert the value to bigint. Apart from that, integer values are always faster in retrieval than string or DateTime values. At least there will be some milliseconds difference.

The bigInt values that we see in the datetime columns are actually converted figures based on unix epoch time. Look at the below screenshot.


SELECT EXTRACT(EPOCH FROM NOW()); 

This will return a double-precision floating-point that represents the exact number of seconds, and milliseconds, that have passed since January 1, 1970. This means there is 1597737544 seconds difference between "1970-01-01 00:00:00" and "2020-08-18 09:59:04" (UTC Time).

Hope this answers the above-mentioned questions. However if you are curious to know what epoch time is, and why it is 1970-01-01 as the start date then continue reading the below.

Early Unix engineers picked that date discretionarily, in light of the fact that they expected to set a uniform date for the beginning of time, and New Year's Day, 1970, appeared to be generally helpful.

So the Unix epoch is midnight on January 1, 1970. It's time zero for any device that uses Unix.


Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...