Friday, October 2, 2020

SPACE function in Cloudera's Impala

There are some of the functions that hardly come into use. Often this kind of function would have some other replacement that we use. SPACE is one of the functions introduced by Cloudera's Impala, which returns the concatenated string to the specified number of spaces.

Let's see what it does.

SELECT SPACE(10);
Result:
SPACE(10)
---------------
'       '


SELECT CONCAT('firstname', space(1), 'lastname');

Result:
contact('firstname', space(1), 'lastname')
------------------------------------------------
firstname lastname



TRANSLATE & REPLACE functions in Cloudera's Impala

This article introduces the new TRANSLATE and REPLACE string functions available from version 2.9.0 of Cloudera Impala. Both of these functions look identical, shifting letters from one to the other. There is a major difference between them, however. 

Let's see how close these functions are:

SELECT REPLACE ('Flat_720', '_', '#');
SELECT TRANSLATE ('Flat_720','_','#');

Result:
Both returns "Flat#720'

Let's see what documentation says about these functions.

The purpose of the 'REPLACE' function is that it returns the initial argument with all occurrences of the target string replaced by the replacement string. 

Let's dig into more. 

This is equivalent to the REPLACE function that is available in most of the RDBMS platforms (MySQL, Oracle & SQL Server). The matching is case-sensitive. If any argument is NULL, the return value is NULL.

Examples:
SELECT REPLACE('MySQL is a free software', 'MySQL', 'Oracle MySQL') As Txt;

Result:
Txt
-----------------
Oracle MySQL is a free software

SELECT REPLACE('obrocodobro','o','a') AS replaced_string;

Result:
replaced_string
--------------------
abracadabra

If no match found, the original string is returned unchanged.
SELECT REPLACE('SQL Bank', 'Code','Repository') As Output;

Result:
Output
---------
'SQL Bank'

TRANSLATE: Returns the input string with each character in the from argument replaced with the corresponding character in the to argument. The characters are matched in the order they appear in from and to.

Example:
SELECT TRANSLATE ('hello world','world','earth') as Output

Result:
Output
---------------------
hetta earth

It translates letter by letter. If you look at the above example, the word "world" is replaced by "earth". In addition, it replaced the fourth letter 'L' from the string "world" to the fourth letter "T" from the word "earth". The change is applied to the entire string rather than just the input string. Similarly, the second letter 'o' is replaced from "world" to the second letter 'a' from 'earth'. With this example, it is obvious that we cannot use 'TRANSLATE' instead of 'REPLACE.'

Where we use 'TRANSLATE' then? Let's see.

SELECT TRANSLATE('3*[2+1]/{8-4}', '[]{}', '()()') as Output;

Result:
Output
-------------------
3*(2+1)/(8-4)




Wednesday, September 30, 2020

Chaining Common Table Expressions in Impala

This article will help you to learn how to implement chaining common table expressions in Cloudera's Impala.

-~ Chaining CTEs

WITH
Temp1 AS (
         SELECT OrderID,
                         OrderStatus,
                         ProductID
          FROM temp2
          WHERE order_status = 'completed'
),
Temp2 AS (
          SELECT OrderID,
                          OrderStatus,
                          ProductID
          FROM Orders
          WHERE ProductID = 5
)
SELECT * FROM (SELECT COUNT(1) FROM Temp1) a;

Technically the query doesn't make any sense, it can be written in a straight-forward and in a simple way -

SELECT COUNT(1)
FROM Orders
WHERE ProductID = 5
AND order_status='completed'  

However, observe the structure or the pattern. In order to execute the first common table expression, it needs the values from the second expression. 

Get the unique values from multiple CTEs - Impala

Often, some requests for reports seem simple and straightforward. But it is challenging to find a better solution to be applied. Cloudera's Impala provides several features, and in order to accomplish the tasks in the best possible way, an analyst needs to be aware of them. 

Assume that, after filtering out specific values, there are two separate result sets that need to be combined into a single output with distinct values. How would you do that?

Let's see with an example.

Below are two sample datasets:


Now the requirement is -
i) Get the products that cost over 500 AED from "DXB_Products" table  
ii) Get the products from SHJ_Products that cost less than 500 AED.
iii) Now combine both the resultsets and filter-out unique products.

Common table expression is the best way to get the desired output. Here's the query -

WITH CTE1 AS (
           SELECT * FROM DXB_Products WHERE Price > 500
                            ),
           CTE2 AS (
           SELECT * FROM SHJ_Products WHERE Price > 500
                           )
SELECT * FROM CTE1 UNION SELECT * FROM CTE2;


Hope you find this article useful.

Starting Hive & Hive metastore from command line

The following commands will help in starting the services of Hive and Hive metastore in case they didn't start automatically without logging into Cloudera manager. 

sudo service hive-metastore start 


sudo service hive-server2 start


Do let me know if you are facing any issues.




Split equivalent in Impala

Split function splits the data based on the delimiter provided and it is mostly used function in Apache Hive. This function is not available in Impala. However, there is an alternative to it.

Let us first see the usage of the "split" function in Hive.

Below is the patient's blood pressure variations information.

TableName: PatientsData
Systolic-Diastolic
122/80, 122/83, 130/83, 135/86, 140/95, 147/92

SELECT split(data,'\/') as split_data from PatientsData;

Result:
split_data
122,80
122,83
130,83
130,83
135,86
140,95
147,92

SELECT split(data,'\/')[0] AS Systolic, 
               split(data,'\/')[1] AS Diastolic 
FROM PatientsData;

Result:
Systolic   Diastolic
122          80
122          83
130          83
130          83
135          86
140          95
147          92


Let's do the same exercise in Impala using "split_part" function.

SELECT split_part(data,'\/',1) AS Systolic, 
               split_part(data,'\/',2) AS Diastolic 
FROM PatientsData;

Result:
Systolic   Diastolic
122          80
122          83
130          83
130          83
135          86
140          95
147          92

As per the documentation from Apache, below is the description of the function.

SPLIT_PART(STRING source, STRING delimiter, BIGINT index)

Purpose: Returns the requested indexth part of the input source string split by the delimiter.

If the index is a positive number, returns the indexth part from the left within the source string. If the index is a negative number, returns the indexth part from the right within the source string. If the index is 0, returns an error.

The delimiter can consist of multiple characters, not just a single character.

All matching of the delimiter is done exactly, not using any regular expression patterns.

Return type: STRING



Joins in HiveQL

    As we discussed earlier, HiveQL handles structured data only, much like SQL. This doesn't mean that Hive just manages structured data, it also processes and transforms the unstructured data into a readable structured way. 

    Unstructured data is usually not dependent on static data or other data files. But in structured data, especially if the data is imported from relational systems, due to normalization, the tables may be connected with other tables to obtain meaningful information for a few columns. Hence functionalities of SQL will also be needed in big-data platforms such as Joins, Sub-queries, casting, and conversion functions.

    This article focuses on the joins that are available in Hive.

    Below is the type of joins available in Hive.

    • INNER JOIN

    • LEFT OUTER JOIN

    • RIGHT OUTER JOIN

    • FULL OUTER JOIN

    I am not specifying "self-join" explicitly because it is also an inner join. 

    INNER JOIN is the join type that combines two tables to return records that have matching values from both the tables.

    LEFT OUTER JOIN returns all records from the left table and the matched records from the right table.

    RIGHT OUTER JOIN is a join that returns all records from the right table, and the matched records from the left table.

    FULL OUTER JOIN displays all the records i.e. matched records and unmatched records from both the tables. 

    Not just the description but the implementation is also as same as in SQL.

    Please do let me know if you want to see how it is to be implemented in Hive.


    Can we create a table based on a view in HIve?

    Can we create a table based on a view in Hive?

    Yes, we can.!!

    Let's create "View" by combining Emp and Dept tables.

    -~ To create a view

    CREATE VIEW Emp_View
    AS
    SELECT Ename, DName FROM Emp
    INNER JOIN Dept ON Emp.DeptNo = Dept.DeptNo;

    Now let's create the table based on the view.

    CREATE TABLE EmpDept
    AS
    SELECT * FROM Emp_View;



    Creating a table and a view with the select statement.

    In this article, you'll learn how to create a view and table based on a select statement.


    -~ To create a view

    CREATE VIEW Emp_View 
    AS
    SELECT Ename, DName FROM Emp
    INNER JOIN Dept ON Emp.DeptNo = Dept.DeptNo;


    -~ To create a table

    CREATE TABLE EmpDept 
    AS 
    SELECT Ename, DName FROM Emp
    INNER JOIN Dept ON Emp.DeptNo = Dept.DeptNo;


    Can we create a table based on a view? Click here.


    Hive - Extended Properties

    In this article, you will learn how to list out the properties of a database or a table in Hive.

    DATABASE LEVEL:

    DESCRIBE DATABASE db_name;
    DESCRIBE SCHEMA db_name;

    Database or schema both are the same thing. These words can be used interchangeably.

    DESCRIBE DATABASE EXTENDED db_name;
    Use the above command to list all the database properties attached to a particular database in Hive.


    TABLE LEVEL:

    DESC TableName
    DESCRIBE TableName
    Use the above command to get the schema of the table.

    DESC EXTENDED TableName
    Use the above command to get detailed information about the table that includes comments, last modified date, etc along with the table's definition.

    DESC FORMATTED TableName
    Use the above command to get the summary, details, and formatted information about the specified table.


    Big Data & SQL

    Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...