Missing functions: CUME_DIST & NTILE

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Junior Contributor

In my previous post i covered some ranking functions including SQL:2008 RANK, ROW_NUMBER & DENSE_RANK.

But Standard SQL defines three additional functions based on ranking:

PERCENT_RANK and CUME_DIST as relative rankings and a percentile calculation named NTILE.

PERCENT_RANK

The well-known PERCENT_RANK calculates a relative rank based on the number of rows with a value less than the current value.

It returns a number in the range zero to one, assigning zero to the top ranked row:

PERCENT_RANK() OVER (PARTITION BY part_col ORDER BY data_col)

The calculation is based on this formula:

CASE 
WHEN COUNT(*) OVER (PARTITION BY part_col) = 1 THEN 0
ELSE
(CAST(RANK() OVER (PARTITION BY part_col ORDER BY data_col) AS FLOAT) - 1)
/(COUNT(*) OVER (PARTITION BY part_col) -1)
END AS "PERCENT_RANK"

PERCENT_RANK is implemented in Teradata, so let's move on to the missing functions.

CUME_DIST

CUME_DIST returns a relative rank, too, but uses a slightly different calculation based on the number of rows with a value less than or equal to the current value.

CUME_DIST() OVER (PARTITION BY part_col ORDER BY data_col ASC)

SQL:2008 defines an equivalent query for CUME_DIST:

CAST(COUNT(*)
OVER (PARTITION BY part_col
ORDER BY data_col
RANGE UNBOUNDED PRECEDING ASC) AS FLOAT)
/ COUNT(*) OVER (PARTITION BY part_col)

Btw, did anyone of you ever read the Standard SQL documents?

Or tried to comprehend what you read there?

An equivalent query like the previous one is a rare case, those papers usually have the wording of a law code and you must be a lawyer (or similar) to read more than a few paragraphs ;-)

The official documents from the ISO-organization are quite expensive, but searching the web reveals the latest drafts which are very close to those, e.g. http://www.wiscorp.com/SQLStandards.html features the SQL:20nn Working Draft Documents of the forthcoming SQL:2011.

Back to the CUME_DIST equivalent.

You'll recognize the definition of the first count from my previous post (#2 Modified competition ranking), it can be directly transformed into:

(COUNT(*) OVER (PARTITION BY part_col)
- CAST(RANK()
OVER (PARTITION BY part_col ORDER BY data_col DESC) AS FLOAT) + 1
) / COUNT(*) OVER (PARTITION BY part_col) AS "CUME_DIST"

Please note once more the reversed sort order.

NTILE

While CUME_DIST was quite simple NTILE is a real tough one. The name implies a similarity to Teradata's QUANTILE function, but the output might be totally different. If there are duplicate values then QUANTILE will put them in the same bucket whereas NTILE might split them into different groups (and the bucket numbers start at zero for QUANTILE and one for NTILE, but this is easy to fix). I would strongly suggest to reconsider if NTILE is actually needed.

Teradata's QUANTILE is a deprecated feature based on old syntax, thus the manuals show how to rewrite

QUANTILE(B,data_col asc)
...
GROUP BY part_col

to

B * (RANK() OVER (PARTITION BY part_col ORDER BY data_col) - 1) 
/ COUNT(*) OVER(PARTITION BY part_col) AS "QUANTILE"

Now replace the RANK in this formula by a ROW_NUMBER and the number of rows per bucket can differ by at most one, so it's almost the same as NTILE.

To get exactly the same output as NTILE you need to apply some additional logic:

If the number of rows N isn't divisible without reminder by the number of buckets B then the first N MOD B buckets get an extra row.

As i'm not a mathematician the shortest formula i found after some trial and error is the following one:

--NTILE(B) OVER (PARTITION BY part_col ORDER BY data_col) 

,COUNT(*) OVER(PARTITION BY part_col) AS N
,ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col) AS rowno
,CASE
WHEN rowno <= ((N/B)+1) * (N MOD B)
THEN (rowno-1) / ((N/B)+1)
ELSE (rowno-1 - (N MOD B)) / (N/B)
END + 1 AS "NTILE"

If anyone can simplify this please raise your hand.


In my next post i will discuss a MEDIAN calculation plus the missing SQL:2008 "inverse distribution functions" PERCENTILE_CONT and PERCENTILE_DISC.

6 Comments
N/A
Hi,
Is there a direct way to find the n'th percentile in a set of numbers just the like the way it is calculated in excel??
- Emkay
Junior Contributor
Hi Emkay,
Excel's PERCENTILE is the same as another missing function: PERCENTILE_CONT.

I just finished my post on it:
http://developer.teradata.com/blog/dnoeth/2011/06/missing-functions-percentile-disc-percentile-cont-median

Dieter
N/A
Thanks a lot Dieter
-Emkay
Enthusiast
Denoth,

Can you please explain how to prerform Tanslate & replace functions in Teradata?

Thnaks ,
Sai
Junior Contributor
Hi Sal,
you better don't try to rewrite translate & replace using plain Standard SQL.
Just use the existing oreplace/otranslate from the Oracle UDF package or wait for TD14.

Dieter
Teradata Employee

CUME_DIST is available in 14.10