Missing Functions: PERCENTILE_DISC, PERCENTILE_CONT & MEDIAN

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Junior Contributor

In my previous posts on missing functions i covered SQL:2008 DENSE_RANK and CUME_DIST & NTILE.

This time i'll talk about additional functions based on sorted sets of values, SQL:2008's PERCENTILE_DISC and PERCENTILE_CONT plus a MEDIAN calculation.

PERCENTILE_DISC

The CUME_DIST function computes the cumulative distribution of a set of values. The inverse operation to find the value for a given percentile is done using PERCENTILE_DISC.

According to SQL:2008 PERCENTILE_DISC(x) is the first value with a CUME_DIST greater than or equal to x.

This directly translates to the row with a

ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY order_col)
= CEILING(COUNT(*) OVER (PARTITION BY part_col) * x

There's no CEILING function prior to TD13.10, but this returns the same value:

CAST(COUNT(*) OVER (PARTITION BY part_col) * x + 0.999999999999999 AS INT)

Caution:

This calculation will not work for x = 0, but you better use good old MIN instead of PERCENTILE_DISC(0).

-- SQL:2008 PERCENTILE_DISC aggregate function
SELECT
part_col
,PERCENTILE_DISC(x) WITHIN GROUP (ORDER BY data_col) --aggregate function
FROM tab
GROUP BY
part_col;

-- Teradata equivalent
SELECT
part_col
,data_col
FROM tab
QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
= CAST(x * COUNT(*) OVER (PARTITION BY part_col) + 0.999999999999999 AS INT);

PERCENTILE_DISC is also defined as OLAP function adding the OVER clause. This requires moving the QUALIFY-condition into a CASE and another nested OLAP function:

-- SQL:2008 PERCENTILE_DISC OLAP function
SELECT
part_col
,data_col
,PERCENTILE_DISC(x) WITHIN GROUP (ORDER BY data_col) OVER (PARTITION BY part_col)
FROM tab

-- Teradata equivalent
SELECT
  part_col
  ,data_col
  ,MIN(pc) OVER (PARTITION BY part_col) AS PERCENTILE_DISC
FROM
 (
  SELECT
    part_col
    ,data_col
    ,CASE
      WHEN ROW_NUMBER () OVER (PARTITION BY part_col ORDER BY data_col)
         = CAST(COUNT(*) OVER (PARTITION BY part_col) * x + 0.999999999999999 AS INT)
      THEN data_col
    END AS pc
  FROM tab
 ) AS dt

PERCENTILE_CONT

Whereas PERCENTILE_DISC returns one of the existing values, PERCENTILE_CONT is based on a linear interpolation between two consecutive values. First you calculate a hypothetical row number rn = (x * (n - 1)) + 1, where x is the percentile and n is the number of rows per group. If rn has no fractional part then you already found the result row, else you have to consider the following row, too:

-- SQL:2008 PERCENTILE_CONT aggregate function
SELECT
part_col
,PERCENTILE_CONT(x) WITHIN GROUP (ORDER BY order_col)
FROM tab
GROUP BY
part_col;

-- Teradata equivalent
SELECT
part_col
,data_col
+ ((MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col)
* (((COUNT(*) OVER (PARTITION BY part_col) - 1) * x) MOD 1)) AS percentile_cont
FROM tab
QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
= CAST((COUNT(*) OVER (PARTITION BY part_col) - 1) * x AS INT) + 1;

-- SQL:2008 PERCENTILE_CONT OLAP function
SELECT
part_col,
data_col,
PERCENTILE_CONT(x) WITHIN GROUP (ORDER BY data_col)
OVER (PARTITION BY part_col)
FROM tab

-- Teradata equivalent
SELECT
  part_col
  ,data_col
  ,MIN(pc) OVER (PARTITION BY part_col) AS PERCENTILE_CONT
FROM
 (
  SELECT
    part_col
    ,data_col
    ,COUNT(*)      OVER (PARTITION BY part_col) - 1 AS N
    ,ROW_NUMBER()  OVER (PARTITION BY part_col ORDER BY data_col) - 1 AS rowno
    ,MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col AS diff
    ,CASE
       WHEN rowno = CAST(N * x AS INT)
       THEN data_col +(((N * x) MOD 1) * diff)
     END AS pc
  FROM tab
 ) AS dt

The CASE technique can also be used to calculate multiple percentiles in a single query, e.g. the Quartiles as aggregate functions:

SELECT part_col
  ,MIN(pc25) OVER (PARTITION BY part_col) AS quartile_1
  ,MIN(pc50) OVER (PARTITION BY part_col) AS quartile_2
  ,MIN(pc75) OVER (PARTITION BY part_col) AS quartile_3
FROM
 (
  SELECT
    part_col
    ,COUNT(*)      OVER (PARTITION BY part_col) - 1 AS N
    ,ROW_NUMBER()  OVER (PARTITION BY part_col ORDER BY data_col) - 1 AS rowno
    ,MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col AS diff

    ,CASE
       WHEN rowno = CAST(N * 0.25 AS INT)
       THEN data_col +(((N * 0.25) MOD 1) * diff)
     END AS pc25
 
    ,CASE
       WHEN rowno = CAST(N * 0.50 AS INT)
       THEN data_col +(((N * 0.50) MOD 1) * diff)
     END AS pc50
 
    ,CASE
       WHEN rowno = CAST(N * 0.75 AS INT)
       THEN data_col +(((N * 0.75) MOD 1) * diff)
     END AS pc75
  FROM tab
  QUALIFY   rowno = CAST(N * 0.25 AS INT)
       OR   rowno = CAST(N * 0.50 AS INT)
       OR   rowno = CAST(N * 0.75 AS INT)
 ) AS dt
QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY part_col) = 1

Omitting both QUALIFY clauses will return the quartiles as OLAP functions.

MEDIAN

The MEDIAN is just the special case of the 50th percentile.

In SQL:2008 there's no reason to define an extra funcion for that, but above formula can be simplified for a median calculation:

If the number of rows N is odd, then the median value is found in row (N/2)+1 else it's the average of this row's and the previous row's value:

-- MEDIAN aggregate function
SELECT
part_col
,MEDIAN(data_col)
FROM tab
GROUP BY part_col;

-- Teradata equivalent
SELECT
part_col
,CASE
WHEN COUNT(*) OVER (PARTITION BY part_col) MOD 2 = 1 THEN data_col
ELSE AVG(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS 1 PRECEDING)
END
FROM tab
QUALIFY
ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
= COUNT(*) OVER (PARTITION BY part_col) / 2 + 1;

-- MEDIAN OLAP function
SELECT
part_col
,MEDIAN(data_col) OVER (PARTITION BY part_col)
FROM tab
GROUP BY part_col;

-- Teradata equivalent
SELECT
part_col
,MIN(med) OVER (PARTITION BY part_col) AS "MEDIAN"
FROM
(
SELECT
part_col
,CASE
WHEN ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
= COUNT(*) OVER (PARTITION BY part_col) / 2 + 1
THEN
CASE
WHEN COUNT(*) OVER (PARTITION BY part_col) MOD 2 = 1 THEN data_col
ELSE AVG(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS 1 PRECEDING)
END
END AS med
FROM tab
) AS dt

This MEDIAN calculation returns the same result as the correspondent function in Excel and PERCENTILE_CONT reproduces Excel's PERCENTILE function.

11 Comments
Enthusiast
A year or so ago I have been working on the same topic and came up with the following implementations. The mathematical definition used here must be a bit different from the SQL standart but perhaps it is worth for you to comment on.

/*** "financial median": Group COUNT + ROW_NUMBER ***/
SELECT
dt1.part_col
,AVG(data_col)
FROM
(SELECT
part_col
,data_col
,ROW_NUMBER() OVER (PARTITION BY part_col
ORDER BY data_col) AS row_num
,COUNT(*) OVER (PARTITION BY part_col
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS row_count
FROM tab
) AS dt1
WHERE
row_num = (row_count + 1) / 2
OR
row_num = (row_count / 2) + 1
GROUP BY part_col
ORDER BY part_col
;

/*** "statistical median": Group COUNT + ROW_NUMBER ***/
SELECT
dt1.part_col
,data_col
FROM
(SELECT
part_col
,data_col
,ROW_NUMBER() OVER (PARTITION BY part_col
ORDER BY data_col) AS row_num
,COUNT(*) OVER (PARTITION BY part_col
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS row_count
FROM tab
) AS dt1
WHERE
row_num = (row_count + 1) / 2 --left (lesser) value
-- row_num = (row_count / 2) + 1 --right (greater) value
ORDER BY part_col
;

Junior Contributor
Hi Egor,
i know those queries, i wrote them a few years ago :-)

The 1st is exactly the same as the SQL:2008 calculation, I only changed the AVG calculation from aggregate to OLAP.

Dieter
Enthusiast
Hi. Is it possible to write OLTP UDF in Teradata? Functions are good, but it is not convenient to use them...
Junior Contributor
Did you mean OLAP UDF?
Yes, you can write your own OLAP functions (in C) in TD13.10, there's a DENSE_RANK example in the "SQL External Routine Programming" manual.

But it's probably quite hard to do so unless you are an experienced C-programmer and find a good algorithm to implement...

Dieter
Enthusiast
Hi dnoeth,can you please help on this

question

I am working on TD13 trial version….

CREATE SET TABLE tduser.jn1_emp ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

emp_no INTEGER,

emp_loc varchar(12))

Unique PRIMARY INDEX ( emp_no );

Insert into tduser.jn1_emp(1,’hyd’);

Insert into tduser.jn1_emp(2,’bang’);

Insert into tduser.jn1_emp(3,’visak’);

Collect stats on tduser.jn1_emp index(emp_no);

CREATE SET TABLE tduser.jn2_emp ,NO FALLBACK ,

NO BEFORE JOURNAL,

NO AFTER JOURNAL,

CHECKSUM = DEFAULT

(

pme_no INTEGER,

emp_name varchar(12))

Unique PRIMARY INDEX ( pme_no );

Insert into tduser.jn2_emp(1,’raj’);

Insert into tduser.jn2_emp(2,’ravi’);

Insert into tduser.jn2_emp(4,’kishore’);

Collect stats on tduser.jn2_emp index(pme_no);

If I am trying to execute the following it is giving “low confidence” in the explain plan.can anybody suggest how to make it to “high confidence”

Explain sel * from tduser.jn1_emp, tduser.jn2_emp

Where emp_no = pme_no
Junior Contributor
I don't think you will be able to get high confidence for this join.

Maybe adding a foreign key might help, but why do you actually want it?
The estimated number would probably not change.

Btw, you should post question like this as a new topic in an appropiate forum.

Dieter
med
Fan
hello, i need some help please,

I have a dataset similar to the following:

Patient_ID Unit Start_Time End_time
1234 3FE 3/2/09 12:00 3/5/09 13:00
1234 3FE 5/5/09 13:00 5/5/09 14:00
1234 3FE 5/5/09 14:00 5/7/09 18:00
1234 ICU 5/7/09 18:00 5/16/09 5:00
1234 3FE 5/16/09 5:00 5/23/09 7:00
1234 3FE 5/23/09 7:00 5/25/09 3:00
1234 3FE 5/25/09 3:00 5/28/09 8:00

The result set should look like the following:

Patient_ID Unit Start_Time End_time
1234 3FE 3/2/09 13:00 5/7/09 18:00
1234 ICU 5/7/09 18:00 5/16/09 5:00
1234 3FE 5/16/09 5:00 5/28/09 8:00

Any suggestions would be appreciated.

Junior Contributor
Answered at
http://forums.teradata.com/forum/database/qualify-rank-over-partition-question#comment-18797

Dieter
Teradata Employee

Hi,I'm SA in Japan Teradata associates .

Please help me.

I am looking at the median UDF function.(Environment that I want to implement is Teradata14.0)

Please send it to me, and if you have a function.

-I know that the function of median is implemented from Teradata14.10.

Teradata Employee

Hi,I'm SA in Japan Teradata associates .

Please help me.

I am looking at the median UDF function.(Environment that I want to implement is Teradata14.0)

Please send it to me, and if you have a function.

-I know that the function of median is implemented from Teradata14.10.

Teradata Employee
Hi,I'm SA in Japan Teradata associates .

Please help me.

I am looking at the median UDF function.(Environment that I want to implement is Teradata14.0)

Please send it to me, and if you have a function.

-I know that the function of median is implemented from Teradata14.10.