SQLBank: Apache Hive Analytical Functionns

Thursday, August 13, 2020

Hive Analytical Functions

The below is the compiled list of aggregate, analytical & advanced functions in Apache Hive. Some of them are widely used ones which we will be discussed in detail in the upcoming articles.

• Standard aggregations: Such as COUNT(), SUM(), MIN(), MAX(), or AVG().

• RANK: The RANK analytics function is used to assign a rank to the rows based on the column values in OVER clause. The row with equal values assigned the same rank with next rank value skipped.

• DENSE_RANK: The DENSE_RANK analytics function in hive used to assign a rank to each row. The rows with equal values receive the same rank and this rank assigned in the sequential order so that no rank values are skipped.

• ROW_NUMBER: It assigns a unique sequence number starting from 1 to each row according to the partition and order specification.

• CUME_DIST: It computes the number of rows whose value is smaller or equal to the value of the total number of rows divided by the current row. This function stands for cumulative distribution. It computes the relative position of a column value in a group. Here, we can calculate the cumulative distribution of salaries among all departments.

• PERCENT_RANK: It is similar to CUME_DIST, but it uses rank values rather than row counts in its numerator as a total number of rows – 1 divided by current rank – 1. Therefore, it returns the percent rank of a value relative to a group of values.

• NTILE: It divides an ordered data set into a number of buckets and assigns an appropriate bucket number to each row. It can be used to divide rows into equal sets and assign a number to each row.

• LEAD: The LEAD function, lead(value_expr[,offset[,default]]), is used to return data from the next row. The number (value_expr) of rows to lead can optionally be specified.

If the number of rows (offset) to lead is not specified, the lead is one row by default. It returns [,default] or null when the default is not specified and the lead for the current row extends beyond
the end of the window.

• LAG: The LAG function, lag(value_expr[,offset[,default]]), is used to access data from a previous row. The number (value_expr) of rows to lag can optionally be specified. If the number of rows (offset) to lag is not specified, the lag is one row by default. It returns [,default] or null when the default is not specified and the lag for the current row extends beyond the end of the window.

• ROUND: rounds a number to a specified number of decimal places

• FLOOR: rounded up any positive or negative decimal value down to the next least integer value.

• CEIL: is used to get the smallest integer which is greater than, or equal to, the specified numeric expression.

• RLIKE: is a relational operator (unlike arithmetic operators such as =, >, <, !=, etc.) is similar to LIKE in SQL. Another relational operator which is equal to RLIKE is, A REGEXP B.

• Regexp_extract: it returns the string extracted using the pattern.

• APPX_MEDIAN: An aggregate function that returns a value that is approximately the median (midpoint) of values in the set of input values.

• InitCap: Used for Proper Case.

• Concat: Concatenation can be done using ‘||’ symbol to join strings/columns.

• STDDEV: is to calculate standard deviation.

• Variance: An aggregate function that returns the variance of a set of numbers. This is a mathematical property that signifies how far the values spread apart from the mean. The return value can be zero (if the input is a single value, or a set of identical values), or a positive number otherwise.

• Var_POP: Population variance

• Var_SAMP: Returns the unbiased sample variance

• STDDEV_POP: Population Standard Deviation

• STDDEV_SAMP: Sample Standard Deviation

• COVAR_POP: returns the population covariance of a pair of numeric columns in the group

• CORR: Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group.

• PERCENTILE: Returns the exact pth percentile of a column in the group

• PERCENTILE_APPROX: Returns an approximate pth percentile of a numeric column (including floating point types) in the group

• HISTOGRAM_NUMERIC: Returns a histogram of a numeric column in the group using b non-uniformly spaced bins.

• COLLECT_SET: Returns a set of objects with duplicate elements eliminated

• COLLECT_LIST: Returns a list of objects with duplicates.

• CONCAT_WS: Used for concatenation with specified delimiter.

Thursday, August 13, 2020

Hive Analytical Functions

Big Data & SQL