Wednesday, September 30, 2020

Word Count in HiveQL - Explode and Split Usage

This article aims to explain the usage of the SPLIT function in HiveQL. If you are looking for a similar function in SQL Server, then please click here.


Let's create a staging table to load the data temporarily.
CREATE TABLE tempData (col1 STRING);

Load the data to the table.
LOAD DATA INPATH 'Desktop/DataFile' OVERWRITE INTO TABLE tempData;

To split the data from the above-created temp table 
SELECT word, count(1) AS count FROM
(SELECT explode(split(col1, '\s')) AS word FROM tempData) temp
GROUP BY word
ORDER BY word;

Split function splits the data based on the delimiter provided. The Explode function will further split the data into smaller chunks. Let's see what these explode and split functions are doing with another example.

Below is the patient's blood pressure variations information.

TableName: PatientsData
Systolic-Diastolic
122/80, 122/83, 130/83, 135/86, 140/95, 147/92

SELECT split(data,'\/') as split_data from PatientsData;

Result:
split_data
122,80
122,83
130,83
130,83
135,86
140,95
147,92

SELECT split(data,'\/')[0] AS Systolic, split(data,'\/')[1] AS Diastolic from PatientsData;

Result:
Systolic   Diastolic
122          80
122          83
130          83
130          83
135          86
140          95
147          92

SELECT explode(split(data,'\/')) as exploded_data from PatientsData;

Result:
exploded_data
122
80
122
83
130
83
130
83
135
86
140
95
147
92

Hope you understood the behavior of the function with the examples.

If you are looking for a word-count program using SQL Server, then click here.
If you are looking for a word-count program using Pig, then click here.


No comments:

Post a Comment

Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...