SQLBank: split function in hive and impala

Wednesday, September 30, 2020

Split equivalent in Impala

Split function splits the data based on the delimiter provided and it is mostly used function in Apache Hive. This function is not available in Impala. However, there is an alternative to it.

Let us first see the usage of the "split" function in Hive.

Below is the patient's blood pressure variations information.

TableName: PatientsData
Systolic-Diastolic
122/80, 122/83, 130/83, 135/86, 140/95, 147/92

SELECT split(data,'\/') as split_data from PatientsData;

Result:
split_data
122,80
122,83
130,83
130,83
135,86
140,95
147,92

SELECT split(data,'\/')[0] AS Systolic,
split(data,'\/')[1] AS Diastolic
FROM PatientsData;

Result:
Systolic Diastolic
122 80
122 83
130 83
130 83
135 86
140 95
147 92

Let's do the same exercise in Impala using "split_part" function.

SELECT split_part(data,'\/',1) AS Systolic,
split_part(data,'\/',2) AS Diastolic
FROM PatientsData;

Result:
Systolic Diastolic
122 80
122 83
130 83
130 83
135 86
140 95
147 92

As per the documentation from Apache, below is the description of the function.

SPLIT_PART(STRING source, STRING delimiter, BIGINT index)

Purpose: Returns the requested indexth part of the input source string split by the delimiter.

If the index is a positive number, returns the indexth part from the left within the source string. If the index is a negative number, returns the indexth part from the right within the source string. If the index is 0, returns an error.

The delimiter can consist of multiple characters, not just a single character.

All matching of the delimiter is done exactly, not using any regular expression patterns.

Return type: STRING

Wednesday, September 30, 2020

Split equivalent in Impala

Big Data & SQL