Please refer to my previous post in which
schema and data for EMP and DEPT tables available.
In this article we are about
to discuss about SQL Server Analytical and Window functions. As stated in Microsoft
docs, analytic
functions calculate an aggregate value based on a group of rows. Unlike
aggregate functions, however, analytic functions can return multiple rows for
each group. Use analytic functions to compute moving averages, running totals,
percentages or top-N results within a group.
These functions are common in most of the RDBMS applications and
are widely used by the data and business analysts.
NTILE
It divides/distributes an ordered data
set (or partition) into a specified number of groups which we call it buckets
and assigns an appropriate (bucket) number to each row. The bucket number will
represent each row to which bucket it belongs to.
In other words, it is used to divide rows
into equal sets and assign a number to each row.
SELECT Ename, sal, NTILE(2) OVER (ORDER BY sal DESC) Bucket FROM Emp;
SELECT Ename, sal, NTILE(5) OVER (ORDER BY sal DESC) Bucket FROM Emp;
The following query retrieves the records
from the first bucket.
SELECT * FROM (
SELECT Ename,
sal, NTILE(4) OVER (ORDER BY sal DESC) Bucket
FROM Emp
) EmpAlias
WHERE Bucket=1;
ROW NUMBER:
This function represents each row with a
unique and sequential value based on the column used in OVER clause. Here, we
are having 10 rows in our Emp table and will use ROW_NUMBER on these records.
This can also be used to assign a serial
/row number to the rows within the provided dataset.
SELECT DeptNo, sal, ROW_NUMBER() OVER (ORDER BY sal) AS row_num FROM emp;
SELECT
DeptNo,
sal,
ROW_NUMBER() OVER (PARTITION BY deptno ORDER BY sal DESC) AS row_num
FROM emp;
RANK:
This function is used to assign a rank to
the rows based on the column values in OVER clause.
The row with equal values assigned the
same rank with next rank value skipped.
SELECT DeptNo, sal, RANK() OVER(ORDER BY sal DESC) AS rnk FROM emp;
SELECT DeptNo,
sal,
RANK() OVER(PARTITION BY DeptNo ORDER BY sal DESC) AS rnk
FROM emp;
DENSE_RANK: The DENSE_RANK analytics function used to assign a rank to each row.
The rows with equal values receive the same rank and this rank assigned in the
sequential order so that no
rank values are skipped.
SELECT
DeptNo,
sal,
DENSE_RANK() OVER(PARTITION BY DeptNo ORDER BY sal DESC) AS dns_rnk
FROM emp;
DeptNo,
sal,
DENSE_RANK() OVER(PARTITION BY DeptNo ORDER BY sal DESC) AS dns_rnk
FROM emp;
Let us use all the above functions in one
query to see the difference in the results.
SELECT DeptNo AS dept, sal AS sal,
ROW_NUMBER() OVER (PARTITION BY DeptNo ORDER BY sal DESC) AS RowNumber,
RANK() OVER (PARTITION BY DeptNo ORDER BY sal DESC) AS iRank,
DENSE_RANK() OVER(PARTITION BY DeptNo ORDER BY sal DESC) AS DenseRank
FROM emp;
CUME_DIST:
This
function stands for cumulative distribution. It computes the relative position
of a column value in a group. Here, we can calculate the cumulative
distribution of salaries among all departments. For a row, the cumulative
distribution of salary is calculated as:
SELECT DeptNo, sal, CUME_DIST() OVER (ORDER BY sal) AS cum_dist FROM emp;
SELECT DeptNo, sal, CUME_DIST() OVER (ORDER BY sal) AS cum_dist FROM emp
WHERE DEPTNO in(20,30);
CUME_DIST(salary) = Number of rows with
the value lower than or equals to salary / total number of rows in the dataset.
In the above example, due to ORDER BY
clause, 1st row from salary will be counted as 1 and it will be divided by the
total number of rows. That is 1/14 = 0.1
For the second row, it is 2/14 = 0.14;
For the 4th row and the next immediate
row too has the same value, it will be calculated as 5/14 = 0.35 and assign it
for the both rows.
Look at the outcome to understand.
PERCENT_RANK:
It is very similar to the CUME_DIST
function. It ranks the row as a percentage. In other words, it calculates the relative
rank of a row within a group of rows.
The range of values returned by
PERCENT_RANK is between 0 to 1 and first row in the dataset is always zero. This
means the return value is of the double type.
Let’s rank the salary by department wise
as percentage:
Percent_Rank = (rank decreased by 1)/(remaining
rows in the group)
SELECT DeptNo, sal,
RANK() OVER (PARTITION BY deptNo ORDER BY sal DESC) AS iRank,
CUME_DIST() OVER (PARTITION BY deptno ORDER BY sal) AS cum_dist,
PERCENT_RANK() OVER (PARTITION BY deptNo ORDER BY sal) AS perc_rnk
FROM EMP;
Go
If you observe its behavior when it
calculates the relative rank for the rows with same values, it assigns same percentage
rank value (0.75) to both of them. The behavior is similar to rank function.
Range Between & Rows Between:
These functions are called window functions
which fetches records right before and after the current record to perform the aggregation.
It is similar to lead and lag functions however a window function defines a
frame or window of rows with a given length around the current row, and
performs a calculation across the set of data in the window. Mostly these
functions will be used to get the cumulative sum/average, running or moving sum
or averages.
Examples:
SELECT DISTINCT DeptNo, --sal,
SUM(sal) OVER(ORDER BY DeptNo RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 'RangeUnbound'
FROM emp
GO
SELECT DISTINCT DeptNo, --sal,
SUM(sal) OVER(ORDER BY DeptNo ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 'RowsUnbound'
FROM emp
GO
SELECT DISTINCT EMPNO, sal,
SUM(sal) OVER(ORDER BY EmpNo ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 'RowsUnbound'
FROM emp
GO
SELECT DISTINCT EMPNO, sal,
SUM(sal) OVER(ORDER BY EmpNo RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 'RangeUnbound'
FROM emp
GO
SELECT DISTINCT Job, sal, DeptNo,
SUM(sal) OVER(ORDER BY DeptNo RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 'RangeUnbound'
FROM emp
GO
lag AND Lead FUNCTIONS:
The
LAG function gets the information from a past column, while LEAD brings
information from an ensuing line. The two functions are fundamentally the same
as one another and you can simply supplant one by the other by changing the
sort request.
SELECT ename, deptno, sal,
LEAD(sal, 1) OVER(PARTITION BY deptno ORDER BY sal) AS lead1,
LEAD(sal, 2) OVER(PARTITION BY deptno ORDER BY sal) AS lead2,
LAG(sal,1) OVER(PARTITION BY deptno ORDER BY sal) AS lag1,
LAG(sal,2) OVER(PARTITION BY deptno ORDER BY sal) AS lag2
FROM emp ORDER BY deptno,sal;
Hope this article helped you in
understanding Analytical and Window functions.
Thank you so much Shafi! This has been very helpful & so easy to understand thanks to your detailed explanation.
ReplyDelete