Wednesday, September 30, 2020

Get the unique values from multiple CTEs - Impala

Often, some requests for reports seem simple and straightforward. But it is challenging to find a better solution to be applied. Cloudera's Impala provides several features, and in order to accomplish the tasks in the best possible way, an analyst needs to be aware of them. 

Assume that, after filtering out specific values, there are two separate result sets that need to be combined into a single output with distinct values. How would you do that?

Let's see with an example.

Below are two sample datasets:


Now the requirement is -
i) Get the products that cost over 500 AED from "DXB_Products" table  
ii) Get the products from SHJ_Products that cost less than 500 AED.
iii) Now combine both the resultsets and filter-out unique products.

Common table expression is the best way to get the desired output. Here's the query -

WITH CTE1 AS (
           SELECT * FROM DXB_Products WHERE Price > 500
                            ),
           CTE2 AS (
           SELECT * FROM SHJ_Products WHERE Price > 500
                           )
SELECT * FROM CTE1 UNION SELECT * FROM CTE2;


Hope you find this article useful.

Starting Hive & Hive metastore from command line

The following commands will help in starting the services of Hive and Hive metastore in case they didn't start automatically without logging into Cloudera manager. 

sudo service hive-metastore start 


sudo service hive-server2 start


Do let me know if you are facing any issues.




Split equivalent in Impala

Split function splits the data based on the delimiter provided and it is mostly used function in Apache Hive. This function is not available in Impala. However, there is an alternative to it.

Let us first see the usage of the "split" function in Hive.

Below is the patient's blood pressure variations information.

TableName: PatientsData
Systolic-Diastolic
122/80, 122/83, 130/83, 135/86, 140/95, 147/92

SELECT split(data,'\/') as split_data from PatientsData;

Result:
split_data
122,80
122,83
130,83
130,83
135,86
140,95
147,92

SELECT split(data,'\/')[0] AS Systolic, 
               split(data,'\/')[1] AS Diastolic 
FROM PatientsData;

Result:
Systolic   Diastolic
122          80
122          83
130          83
130          83
135          86
140          95
147          92


Let's do the same exercise in Impala using "split_part" function.

SELECT split_part(data,'\/',1) AS Systolic, 
               split_part(data,'\/',2) AS Diastolic 
FROM PatientsData;

Result:
Systolic   Diastolic
122          80
122          83
130          83
130          83
135          86
140          95
147          92

As per the documentation from Apache, below is the description of the function.

SPLIT_PART(STRING source, STRING delimiter, BIGINT index)

Purpose: Returns the requested indexth part of the input source string split by the delimiter.

If the index is a positive number, returns the indexth part from the left within the source string. If the index is a negative number, returns the indexth part from the right within the source string. If the index is 0, returns an error.

The delimiter can consist of multiple characters, not just a single character.

All matching of the delimiter is done exactly, not using any regular expression patterns.

Return type: STRING



Joins in HiveQL

    As we discussed earlier, HiveQL handles structured data only, much like SQL. This doesn't mean that Hive just manages structured data, it also processes and transforms the unstructured data into a readable structured way. 

    Unstructured data is usually not dependent on static data or other data files. But in structured data, especially if the data is imported from relational systems, due to normalization, the tables may be connected with other tables to obtain meaningful information for a few columns. Hence functionalities of SQL will also be needed in big-data platforms such as Joins, Sub-queries, casting, and conversion functions.

    This article focuses on the joins that are available in Hive.

    Below is the type of joins available in Hive.

    • INNER JOIN

    • LEFT OUTER JOIN

    • RIGHT OUTER JOIN

    • FULL OUTER JOIN

    I am not specifying "self-join" explicitly because it is also an inner join. 

    INNER JOIN is the join type that combines two tables to return records that have matching values from both the tables.

    LEFT OUTER JOIN returns all records from the left table and the matched records from the right table.

    RIGHT OUTER JOIN is a join that returns all records from the right table, and the matched records from the left table.

    FULL OUTER JOIN displays all the records i.e. matched records and unmatched records from both the tables. 

    Not just the description but the implementation is also as same as in SQL.

    Please do let me know if you want to see how it is to be implemented in Hive.


    Can we create a table based on a view in HIve?

    Can we create a table based on a view in Hive?

    Yes, we can.!!

    Let's create "View" by combining Emp and Dept tables.

    -~ To create a view

    CREATE VIEW Emp_View
    AS
    SELECT Ename, DName FROM Emp
    INNER JOIN Dept ON Emp.DeptNo = Dept.DeptNo;

    Now let's create the table based on the view.

    CREATE TABLE EmpDept
    AS
    SELECT * FROM Emp_View;



    Creating a table and a view with the select statement.

    In this article, you'll learn how to create a view and table based on a select statement.


    -~ To create a view

    CREATE VIEW Emp_View 
    AS
    SELECT Ename, DName FROM Emp
    INNER JOIN Dept ON Emp.DeptNo = Dept.DeptNo;


    -~ To create a table

    CREATE TABLE EmpDept 
    AS 
    SELECT Ename, DName FROM Emp
    INNER JOIN Dept ON Emp.DeptNo = Dept.DeptNo;


    Can we create a table based on a view? Click here.


    Hive - Extended Properties

    In this article, you will learn how to list out the properties of a database or a table in Hive.

    DATABASE LEVEL:

    DESCRIBE DATABASE db_name;
    DESCRIBE SCHEMA db_name;

    Database or schema both are the same thing. These words can be used interchangeably.

    DESCRIBE DATABASE EXTENDED db_name;
    Use the above command to list all the database properties attached to a particular database in Hive.


    TABLE LEVEL:

    DESC TableName
    DESCRIBE TableName
    Use the above command to get the schema of the table.

    DESC EXTENDED TableName
    Use the above command to get detailed information about the table that includes comments, last modified date, etc along with the table's definition.

    DESC FORMATTED TableName
    Use the above command to get the summary, details, and formatted information about the specified table.


    Hive Table Properties

    The TBLPROPERTIES clause enables you to use your own metadata key/value pairs to tag the table definition.

    There are also several predefined table properties, such as last-modified-user and last-modified-time, which Hive automatically adds and manages. 

    To view the properties of a table use the below command in hive prompt.

    SHOW TBLPROPERTIES tblname;

    This lists all the properties of the table. 

    If the table's input format is ORC (refer to the input file formats) then you'll see which compression (snappy or zlib) has opted. You'll see if the transactional property set to true or false. You'll also see the predefined table properties that managed by Hive.



    Word Count in HiveQL - Explode and Split Usage

    This article aims to explain the usage of the SPLIT function in HiveQL. If you are looking for a similar function in SQL Server, then please click here.


    Let's create a staging table to load the data temporarily.
    CREATE TABLE tempData (col1 STRING);

    Load the data to the table.
    LOAD DATA INPATH 'Desktop/DataFile' OVERWRITE INTO TABLE tempData;

    To split the data from the above-created temp table 
    SELECT word, count(1) AS count FROM
    (SELECT explode(split(col1, '\s')) AS word FROM tempData) temp
    GROUP BY word
    ORDER BY word;

    Split function splits the data based on the delimiter provided. The Explode function will further split the data into smaller chunks. Let's see what these explode and split functions are doing with another example.

    Below is the patient's blood pressure variations information.

    TableName: PatientsData
    Systolic-Diastolic
    122/80, 122/83, 130/83, 135/86, 140/95, 147/92

    SELECT split(data,'\/') as split_data from PatientsData;

    Result:
    split_data
    122,80
    122,83
    130,83
    130,83
    135,86
    140,95
    147,92

    SELECT split(data,'\/')[0] AS Systolic, split(data,'\/')[1] AS Diastolic from PatientsData;

    Result:
    Systolic   Diastolic
    122          80
    122          83
    130          83
    130          83
    135          86
    140          95
    147          92

    SELECT explode(split(data,'\/')) as exploded_data from PatientsData;

    Result:
    exploded_data
    122
    80
    122
    83
    130
    83
    130
    83
    135
    86
    140
    95
    147
    92

    Hope you understood the behavior of the function with the examples.

    If you are looking for a word-count program using SQL Server, then click here.
    If you are looking for a word-count program using Pig, then click here.


    Tuesday, September 29, 2020

    File formats and compression in Apache Hive

    The purpose of this article is to address the different file formats and compression codecs in Apache Hive that are available for different data sets. We will also explore how to use them properly and when to use them.

    HiveQL handles structured data only, much like SQL. In order to store the data in it, Hive has a derby database by default. The data will be stored as files in the backend framework while it shows the data in a structured format when it is retrieved. Some special file formats that Hive can handle are available, such as:

    • Text File Format

    • Sequence File Format

    • RCFile (Row column file format)

    • Avro Files

    • ORC Files (Optimized Row Columnar file format)

    • Parquet

    • Custom INPUTFORMAT and OUTPUTFORMAT


    1) TEXT FILE FORMAT:
     
    Hive Text file format is a default storage format to load data from comma-separated values (CSV), tab-delimited, space-delimited, or text files that delimited by other special characters. You can use the text format to interchange the data with other client applications. The text file format is very common for most of the applications. Data is stored in lines, with each line being a record. Each line is terminated by a newline character (\n). 

    The text file format storage option is defined by specifying "STORED AS TEXTFILE" at the end of the table creation.

    2) SEQUENCE FILE FORMAT: 

    Flat files consisting of binary key-value pairs are sequence files. When converting queries to MapReduce jobs, Hive chooses to use the necessary key-value pairs for a given record. The key advantages of using a sequence file are that it incorporates two or more files into one file.

    The sequence file format storage option is defined by specifying  "STORED AS SEQUENCEFILE" at the end of the table creation.

    3) RCFILE FORMAT:

    The row columnar file format is very much similar to the sequence file format. It is a data placement structure designed for MapReduce-based data warehouse systems. This also stores the data as key-value pairs and offers a high row-level compression rate. This will be used when there is a requirement to perform multiple rows at a time. RCFile format is supported by Hive version 0.6.0 and later.

    The RC file format storage option is defined by specifying "STORED AS RCFILE" at the end of the table creation.

    4) AVRO FILE FORMAT:

    Hive version 0.14.0 and later versions support Avro files. It is a row-based storage format for Hadoop which is widely used as a serialization platform. It's a remote procedure call and data serialization framework that uses JSON for defining data types and protocols and serializes data in a compact binary format to make it compact and efficient. This file format can be used in any of the Hadoop’s tools like Pig and Hive.

    Avro is one of the common file formats in applications based on Hadoop. The option to store the data in the RC file format is defined by specifying "STORED AS AVRO" at the end of the table creation.

    5) ORC FILE FORMAT:

    The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data in the Hive table. This file system was actually designed to overcome limitations of the other Hive file formats. ORC reduces I/O overhead by accessing only the columns that are required for the current query. It requires significantly fewer seek operations because all columns within a single group of row data are stored together on disk.

    The ORC file format storage option is defined by specifying "STORED AS ORC" at the end of the table creation.

    6) PARQUET:

    Parquet is a binary file format that is column driven. It is an open-source available to any project in the Hadoop ecosystem and is designed for data storage in an effective and efficient flat columnar format compared to row-based files such as CSV or TSV files. It only reads the necessary columns, which significantly reduces the IO and thus makes it highly efficient for large-scale query types. The Parquet table uses Snappy, which is a fast data compression and decompression library, as the default compression.

    The parquet file format storage option is defined by specifying "STORED AS PARQUET" at the end of the table creation.

    7) CUSTOMER INPUTFORMAT & OUTPUTFORMAT:

    We can implement our own "inputformat" and "outputformat" incase the data comes in a different format. These "inputformat" and "outputformat" is similar to Hadoop MapReduce's input and output formats. 

    In upcoming posts, we will explore more with examples for each and every file format. 

    Have a nice day..!!


    Big Data & SQL

    Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...