Sunday, August 16, 2020

PostgreSQL BigInt Value - Conversion to UTC and Local Times

Most of the ticketing systems by default use either MySQL or PostgreSQL and often stores the date-time of the DML events in Bigint format. When such tables data moved to SQL Server, the bigint value should be converted into date and time for readability and analysis purposes. 

This is not new, however, this article will help you in converting the bigint value to readable UTC (Coordinated Universal Time) and local time in both SQL Server as well as PostgreSQL.  


The below script works in SQL Server and converts the big integer value to UTC and local times.

DECLARE @BigIntVal BIGINT

SELECT @BigIntVal = 1574984632061

SELECT DATEADD(hh, +4, DATEADD(s, CONVERT(BIGINT, @BigIntVal) / 1000, CONVERT(DATETIME, '1-1-1970 00:00:00')))

The bigint value '1574984632061' is initially converted into UTC time and again is converted to Dubai local time by adding 4 hours to it.  You can add or subtract the hours based on your location/timezone.


DECLARE @BigIntVal BIGINT

SELECT @BigIntVal = 1574984632061

SELECT DATEADD(s, CONVERT(BIGINT, @BigIntVal) / 1000, CONVERT(DATETIME, '1-1-1970 00:00:00'))

The above will return the UTC date-time. Similarly, the following code in PostgreSQL will convert the big integer value to UTC date-time.

SELECT TO_CHAR(TO_TIMESTAMP(1574984632061/ 1000), 'YYYY-MM-DD HH24:MI:SS');



Add +4 hours to it to get the local (Dubai) date and time.

SELECT TO_CHAR(TO_TIMESTAMP(1574984632061/ 1000)
                         + INTERVAL '4 hour', 'YYYY-MM-DD HH24:MI:SS') as LocalTime


Hope you find this article helpful.



Computing Running & Cumulative Average

This article helps you in understanding how to calculate running average and cumulative average on a given data-set.

 

As discussed earlier, windowing functions are the same in SQL Server, Oracle & Hive and perform computations across a set of table rows that are related to the current row either by the department, area, category, or date/time.

 

Please observe the sample data-set:


Monthly summarized sales amount by location is specified from which we are about to calculate running average as well as cumulative average.

Running Average / Moving Average / Rolling Average:

The reason to compute a rolling average is to streamline the highs and lows of the dataset and figure out the patterns or trends in the information.

Cumulative Average:

Cumulative or Span of Time is the most common way time is used in a measure. Traditionally Cumulative measures sum data across a span of time.


Both running and cumulative average can be accomplished in a single SELECT statement.

SELECT DISTINCT Location,YYYYMM, Amount,

AVG(Amount) OVER(ORDER BY Location ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) 

    AS 'MovingAverage',

AVG(Amount) OVER(ORDER BY Location ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 

    AS 'CumulativeAverage'

FROM SalesData

GO


Do let me know if you need more clarification.


Saturday, August 15, 2020

Displaying the column names in Hive

In my previous post, I discussed about setting a property in the Hive prompt to print the database name which we are currently working. 

Similarly, there is another Hive configuration property that needs to be enabled in order to view the column names during the data retrieval.

set hive.cli.print.header=true;


Showing the database name in Hive Prompt

Unlike in traditional RDBMS systems, it is difficult to know in which database we are currently working in Hive due to its command-line interface. It always shows the hive prompt.

However, there is a trick!

Set a property to print the current database as part of the prompt to know the database which you are currently in.

set hive.cli.print.current.db=true;


Skipping First and Footer Row - Hive Internal & External Tables

Most of the data-sets (CSV files, Text files, and so forth.) do have header row inside the data, and loading them in the Hive tables will reason null values. This article will assist you in how to deal with the header rows while creating Internal or external tables. 

If you are creating an internal table -

CREATE TABLE IF NOT EXISTS Emp(

empID int,

Ename String,

Sal Decimal,

Dno Int)

row format delimited

fields terminated BY ','

tblproperties("skip.header.line.count"="1");

 

If you are creating an external table –

CREATE EXTERNAL TABLE IF NOT EXISTS Emp(

empID int,

Ename String,

Sal Decimal,

Dno Int)

row format delimited

fields terminated BY ','

LOCATION ‘/user/cloudera/empdirectory

tblproperties("skip.header.line.count"="1");


If in case you already created a table without specifying “skip.header.line.count” then you still can alter the table properties using the following command. 

ALTER TABLE Emp SET TBLPROPERTIES ("skip.header.line.count"="1");

 

Similarly, if you want to skip the footer line you can use the below in the table properties.

"skip.footer.line.count"="1"

Even after this implementation if you are getting the header while querying your table, that is probably because vectorization is enabled. Disabling the feature will help in solving the issue.

set hive.vectorized.execution.enabled=false; 

However, vectorization in Hive is a feature that streamlines the query execution operations by processing a block of 1024 rows at a time rather than reading one row at a time. If this feature is enabled it greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. But enabling or disabling this feature is depends on your data volume and type of data you are dealing with.


Friday, August 14, 2020

Hive Internal Table - With External Data

Have you ever wonder what will happen if you miss the "external" keyword while creating an external table?

Let's check it out.

Here is my sample data. It has three columns namely dno, dname, location.

11, marketing, hyd
12, hr, delhi
13, finance, bang
14, retail, madras
20, db, hyd
21, mg, madras

This file has been placed in /user/cloudera/NewEmpDir as dept.txt

Let us create the external table on top of this data without using "external" keyword.

 

CREATE TABLE newDept

(Dno INT, Dname String, loc String)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

LOCATION '/user/cloudera/NewEmpDir'; 

Surprisingly, the table will be created and the data will also be loaded without any errors.


This is more like an internal table but the data is not stored in Hive. The following things to be observed.

  • Unlike other tables, this table cannot be visible when you browse the files through Hue, though it is visible in Metastore Manager.
  • You can query and retrieve the data from the table.
  • If you drop this table, the source file and directory will also be deleted.


Regular Expressions - RLIKE in Hive

Either SQL Developers or Data Analysts often use arithmetic operators in their queries such as =, >, <, !=

Apart from this, LIKE, EXISTS, IN, NOT IN, NOT EXISTS, etc. will also be used very frequently. 

These are all will help in fetching only the required data.

There is a relational operator (RLIKE) which might not be used the way the above referred to used. But it is helpful in searching the string with similar text the way LIKE operator does in SQL. Both MySQL and Hive provides RLIKE operator that can be used for searching Advanced Regular Expressions.

REGEXP_LIKE in Oracle is equivalent to RLIKE.

Let us see what RLIKE does in Hive.

The below is my “test” table data.

Id

somecol

101

123xyz

102

1234

103

3456

104

abcde

105

Vxyz

106

Bbc


select * from
test where somecol RLIKE '[0-9]+';

This will return all the numbers from the column. If the column value has “123xyz” then it will also be returned. In case if we want to eliminate text contained values and return only numbers then use the following statement.

select * from test where somecol RLIKE '[0-9]+$';

Or you can use-

select * from test where somecol RLIKE '^[0-9]+$';

If you want to retrieve the rows that has no numbers:

SELECT * FROM test WHERE somecol RLIKE '([a-z]|[A-Z])+' 

If you want to retrieve the rows that has numbers with decimal points:

SELECT * FROM test WHERE somecol RLIKE '[0-9]+[.][0-9]+'


Please leave some comments if you need more clarification. 


Thursday, August 13, 2020

"Target-Dir" vs "Warehouse-Dir" in Sqoop

This article is about using “Target-Directory” and “Warehouse-Directory” while Sqoop Import.

Please refer to the below codes.

Code-1: Usage of “Target-Directory”

sqoop import 

--connect jdbc:mysql://localhost/empinfo

--username root

--password cloudera

--table emp

--target-dir /user/hive/warehouse/empinfo;

Code-2: Usage of “Warehouse-Directory”

sqoop import 

--connect jdbc:mysql://localhost/empinfo

--username root

--password cloudera

--table emp

--warehouse-dir /user/hive/warehouse/empinfo;

Both the codes works in the same way. Both ‘target-dir’ and ‘warehouse-dir’ in the above mentioned examples creates the “empinfo” folder in /user/hive/warehouse location.

The difference is, when using “target-dir”, the emp data (part files) will be stored in “empinfo” directly.

The path of the data will be- /user/hive/warehouse/empinfo/part-m-00000.

Warehouse-dir creates the folder named “emp” under “empinfo” and places the data in it.

The path of the data will be /user/hive/warehouse/empinfo/emp/part-m-00000.

 

Note the below points:

  • Target-dir will work only when you import a single table. That implies this won’t work when you use “Sqoop import-all-tables”
  • Warehouse-dir creates the parent directory in which all your tables will be stored in the folders which are named after the table name.
  • If you are importing table by table, each time you need to provide the distinctive target-directory location as target-directory location can’t be same in each import.

Hope you like this article.


Sqoop Import All - Copying MySQL Database to Hive

At the point when I began learning Sqoop, I confronted such huge numbers of blunders for which I was unable to get a lot of help from the discussions or sites. The beneath is one of them which killed parcel of time. 

sqoop import-all-tables

--connect=jdbc:mysql://localhost/retail_db

--username=root

--password=cloudera

--hive-import --hive-database retaildb

--warehouse-dir=/user/hive/warehouse/retaildb.db

-m 1

I wanted to import all the tables and data into Hive database all at once i.e., a database to database import. 

The execution looks fine and I see tables are creating and even I see the data is importing. Be that as it may, after execution, I see just barely any tables created in Hive and not many table's information imported. To confound me more, there is not really any information being shown in Impala considerably even after refreshing the metadata. On the other hand, Hue couldn't open a portion of the documents. 

I did spend a lot of time to figure out what wrong I was doing. 

Finally I got the code right. 

I have assigned a single mapper which caused the issue. The size of the retail_db database is around 450 MB for which assigning a single mapper is inadequate.

         sqoop import-all-tables

--connect=jdbc:mysql://localhost/retail_db

--username=root

--password=cloudera

--hive-import --hive-database retaildb

--warehouse-dir=/user/hive/warehouse/retaildb.db 

The above command creates the internal tables in Hive automatically and data will be imported without any issue.

Import Table From Sqoop to Hive Without Data

The below command helps in creating a table in Hive based on the MySQL table’s definition. 

sqoop create-hive-table

--connect=jdbc:mysql://localhost/retail_db

--username=root

--password=cloudera 

--table employees

--hive-table emp;

This will just create the table structure in Hive as "Emp" based on MySQL's Employees table's structure and no data will be imported. 



If you look at the above screens, the data types are converted automatically. The varchar became string and the decimal became "double" in Hive. 

Before implementing, make sure the table "Employees" exists in MySQL retail_db database and the table "Emp" does not exists in Hive. By default, the table will be imported in "default" database in Hive.

Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...