APPROX_MEDIAN is an approximate inverse distribution function that accept a nonstop/continuous dispersion model. It takes a numeric or datetime value and returns an estimated middle value or an approximate interpolated value that would be the middle value once the values are sorted. Nulls are ignored in the calculation.
In short, median is the middle value of a set of ordered data.
Median = {(n + 1) ÷ 2}th value
n is the number of values in a set of data.
This function is available few RDBMSs like Oracle SQL*Plus and Cloudera Impala, also in Hive, we can achieve it using PERCENTILE function. If the function is not available in the RDBMS in which you work, we still can get the approximate median value in simple steps which we will discuss later.
Click here to get the “Emp” dataset from my previous post if the table and data not exists in your database.
Let's see how we implement it in Cloudera Impala first.
SELECT appx_median(sal) FROM emp;
SELECT DeptID, appx_median(sal) FROM emp GROUP BY DeptID;
The same can be accomplished in Hive with a different function.
SELECT DeptID, PERCENTILE(CAST(sal AS INT),0.5) FROM emp GROUP BY DeptID;
Let's try in Oracle SQL*Plus 12c
SELECT department_id "Department",
APPROX_MEDIAN(salary DETERMINISTIC) "Median Salary"
FROM employees
GROUP BY department_id
In SQL Server:
The below will work only if the compact mode is 110 or higher
select
percentile_cont(0.25) within group(order by sal) over () as percentile_cont_25,
percentile_cont(0.50) within group(order by sal) over () as percentile_cont_50,
percentile_cont(0.75) within group(order by sal) over () as percentile_cont_75,
percentile_cont(0.95) within group(order by sal) over () as percentile_cont_95
from emp;
In case of MySQL, there are many ways to calculate the median value. The workarounds can be found here.
Hope you find this article useful in calculating approximate median in Big Data technologies like Cloudera Impala, Apache Hive and various traditional RDBMSs.