The below is the compiled list of aggregate, analytical & advanced functions in Apache Hive. Some of them are widely used ones which we will be discussed in detail in the upcoming articles.
• Standard aggregations: Such as COUNT(), SUM(), MIN(),
MAX(), or AVG().
• RANK: The
RANK analytics function is used to assign a rank to the rows based on the
column values in OVER clause. The row with equal values assigned the same rank
with next rank value skipped.
• DENSE_RANK: The
DENSE_RANK analytics function in hive used to assign a rank to each row. The
rows with equal values receive the same rank and this rank assigned in the
sequential order so that no rank values are skipped.
• ROW_NUMBER: It assigns a unique sequence
number starting from 1 to each row according to the partition and order
specification.
• CUME_DIST: It computes the number of rows whose
value is smaller or equal to the value of the total number of rows divided
by the current row. This function
stands for cumulative distribution. It computes the relative position of a
column value in a group. Here, we can calculate the cumulative distribution of
salaries among all departments.
• PERCENT_RANK: It is similar to CUME_DIST, but it
uses rank values rather than row counts in its numerator as a total
number of rows – 1 divided by current rank – 1. Therefore, it returns
the percent rank of a value relative to a group of values.
• NTILE: It divides an
ordered data set into a number of buckets and assigns
an appropriate bucket number to each row. It can be used to divide rows
into equal sets and assign a number to each row.
• LEAD: The LEAD function,
lead(value_expr[,offset[,default]]), is used to return data from the next
row. The number (value_expr) of rows to lead can optionally be specified.
If the number of rows
(offset) to lead is not specified, the lead is one row by default. It
returns [,default] or null when the default is not specified and the
lead for the current row extends beyond
the end of
the window.
• LAG: The LAG function,
lag(value_expr[,offset[,default]]), is used to access data from a previous
row. The number (value_expr) of rows to lag can optionally be specified.
If the number of rows (offset) to lag is not specified, the lag is one row
by default. It returns [,default] or null when the default is not
specified and the lag for the current row extends beyond the end of the window.
• ROUND: rounds a number to a specified
number of decimal places
• FLOOR: rounded up any positive or negative
decimal value down to the
next least integer value.
• CEIL: is used to get the smallest integer
which is greater than, or equal to, the specified numeric expression.
• RLIKE: is a relational operator (unlike arithmetic operators such as =,
>, <, !=, etc.) is similar to LIKE in SQL. Another relational operator
which is equal to RLIKE is, A REGEXP B.
• Regexp_extract:
it returns the string
extracted using the pattern.
• APPX_MEDIAN: An aggregate function that returns a
value that is approximately the median (midpoint) of values in the set of input
values.
• InitCap: Used for Proper Case.
• Concat: Concatenation can be done using ‘||’ symbol to join
strings/columns.
• STDDEV: is to calculate standard deviation.
• Variance: An aggregate function that returns the variance of a set of
numbers. This is a mathematical property that signifies how far the values
spread apart from the mean. The return value can be zero (if the input is a
single value, or a set of identical values), or a positive number otherwise.
• Var_POP: Population variance
• Var_SAMP: Returns the unbiased sample variance
• STDDEV_POP: Population Standard Deviation
• STDDEV_SAMP: Sample Standard Deviation
• COVAR_POP: returns the population covariance
of a pair of numeric columns in the group
• CORR:
Returns the Pearson coefficient of correlation of a pair of a numeric columns
in the group.
• PERCENTILE:
Returns the exact pth percentile of a column in the group
• PERCENTILE_APPROX: Returns an approximate pth percentile of a
numeric column (including floating point types) in the group
• HISTOGRAM_NUMERIC: Returns a histogram of a numeric column in
the group using b non-uniformly spaced bins.
• COLLECT_SET:
Returns a set of objects with duplicate elements eliminated
• COLLECT_LIST: Returns a list of objects with duplicates.
• CONCAT_WS: Used for concatenation with specified delimiter.