In my previous posts on missing functions i covered SQL:2008 DENSE_RANK and CUME_DIST & NTILE.
This time i'll talk about additional functions based on sorted sets of values, SQL:2008's PERCENTILE_DISC and PERCENTILE_CONT plus a MEDIAN calculation.
The CUME_DIST function computes the cumulative distribution of a set of values. The inverse operation to find the value for a given percentile is done using PERCENTILE_DISC.
According to SQL:2008 PERCENTILE_DISC(x) is the first value with a CUME_DIST greater than or equal to x.
This directly translates to the row with a
ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY order_col)
= CEILING(COUNT(*) OVER (PARTITION BY part_col) * x
There's no CEILING function prior to TD13.10, but this returns the same value:
CAST(COUNT(*) OVER (PARTITION BY part_col) * x + 0.999999999999999 AS INT)
Caution:
This calculation will not work for x = 0, but you better use good old MIN instead of PERCENTILE_DISC(0).
-- SQL:2008 PERCENTILE_DISC aggregate function
SELECT
part_col
,PERCENTILE_DISC(x) WITHIN GROUP (ORDER BY data_col) --aggregate function
FROM tab
GROUP BY
part_col;
-- Teradata equivalent
SELECT
part_col
,data_col
FROM tab
QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
= CAST(x * COUNT(*) OVER (PARTITION BY part_col) + 0.999999999999999 AS INT);
PERCENTILE_DISC is also defined as OLAP function adding the OVER clause. This requires moving the QUALIFY-condition into a CASE and another nested OLAP function:
-- SQL:2008 PERCENTILE_DISC OLAP function
SELECT
part_col
,data_col
,PERCENTILE_DISC(x) WITHIN GROUP (ORDER BY data_col) OVER (PARTITION BY part_col)
FROM tab
-- Teradata equivalent
SELECT
part_col
,data_col
,MIN(pc) OVER (PARTITION BY part_col) AS PERCENTILE_DISC
FROM
(
SELECT
part_col
,data_col
,CASE
WHEN ROW_NUMBER () OVER (PARTITION BY part_col ORDER BY data_col)
= CAST(COUNT(*) OVER (PARTITION BY part_col) * x + 0.999999999999999 AS INT)
THEN data_col
END AS pc
FROM tab
) AS dt
Whereas PERCENTILE_DISC returns one of the existing values, PERCENTILE_CONT is based on a linear interpolation between two consecutive values. First you calculate a hypothetical row number rn = (x * (n - 1)) + 1, where x is the percentile and n is the number of rows per group. If rn has no fractional part then you already found the result row, else you have to consider the following row, too:
-- SQL:2008 PERCENTILE_CONT aggregate function
SELECT
part_col
,PERCENTILE_CONT(x) WITHIN GROUP (ORDER BY order_col)
FROM tab
GROUP BY
part_col;
-- Teradata equivalent
SELECT
part_col
,data_col
+ ((MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col)
* (((COUNT(*) OVER (PARTITION BY part_col) - 1) * x) MOD 1)) AS percentile_cont
FROM tab
QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
= CAST((COUNT(*) OVER (PARTITION BY part_col) - 1) * x AS INT) + 1;
-- SQL:2008 PERCENTILE_CONT OLAP function
SELECT
part_col,
data_col,
PERCENTILE_CONT(x) WITHIN GROUP (ORDER BY data_col)
OVER (PARTITION BY part_col)
FROM tab
-- Teradata equivalent
SELECT
part_col
,data_col
,MIN(pc) OVER (PARTITION BY part_col) AS PERCENTILE_CONT
FROM
(
SELECT
part_col
,data_col
,COUNT(*) OVER (PARTITION BY part_col) - 1 AS N
,ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col) - 1 AS rowno
,MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col AS diff
,CASE
WHEN rowno = CAST(N * x AS INT)
THEN data_col +(((N * x) MOD 1) * diff)
END AS pc
FROM tab
) AS dt
The CASE technique can also be used to calculate multiple percentiles in a single query, e.g. the Quartiles as aggregate functions:
SELECT part_col
,MIN(pc25) OVER (PARTITION BY part_col) AS quartile_1
,MIN(pc50) OVER (PARTITION BY part_col) AS quartile_2
,MIN(pc75) OVER (PARTITION BY part_col) AS quartile_3
FROM
(
SELECT
part_col
,COUNT(*) OVER (PARTITION BY part_col) - 1 AS N
,ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col) - 1 AS rowno
,MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col AS diff
,CASE
WHEN rowno = CAST(N * 0.25 AS INT)
THEN data_col +(((N * 0.25) MOD 1) * diff)
END AS pc25
,CASE
WHEN rowno = CAST(N * 0.50 AS INT)
THEN data_col +(((N * 0.50) MOD 1) * diff)
END AS pc50
,CASE
WHEN rowno = CAST(N * 0.75 AS INT)
THEN data_col +(((N * 0.75) MOD 1) * diff)
END AS pc75
FROM tab
QUALIFY rowno = CAST(N * 0.25 AS INT)
OR rowno = CAST(N * 0.50 AS INT)
OR rowno = CAST(N * 0.75 AS INT)
) AS dt
QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY part_col) = 1
Omitting both QUALIFY clauses will return the quartiles as OLAP functions.
The MEDIAN is just the special case of the 50th percentile.
In SQL:2008 there's no reason to define an extra funcion for that, but above formula can be simplified for a median calculation:
If the number of rows N is odd, then the median value is found in row (N/2)+1 else it's the average of this row's and the previous row's value:
-- MEDIAN aggregate function
SELECT
part_col
,MEDIAN(data_col)
FROM tab
GROUP BY part_col;
-- Teradata equivalent
SELECT
part_col
,CASE
WHEN COUNT(*) OVER (PARTITION BY part_col) MOD 2 = 1 THEN data_col
ELSE AVG(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS 1 PRECEDING)
END
FROM tab
QUALIFY
ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
= COUNT(*) OVER (PARTITION BY part_col) / 2 + 1;
-- MEDIAN OLAP function
SELECT
part_col
,MEDIAN(data_col) OVER (PARTITION BY part_col)
FROM tab
GROUP BY part_col;
-- Teradata equivalent
SELECT
part_col
,MIN(med) OVER (PARTITION BY part_col) AS "MEDIAN"
FROM
(
SELECT
part_col
,CASE
WHEN ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
= COUNT(*) OVER (PARTITION BY part_col) / 2 + 1
THEN
CASE
WHEN COUNT(*) OVER (PARTITION BY part_col) MOD 2 = 1 THEN data_col
ELSE AVG(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS 1 PRECEDING)
END
END AS med
FROM tab
) AS dt
This MEDIAN calculation returns the same result as the correspondent function in Excel and PERCENTILE_CONT reproduces Excel's PERCENTILE function.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.