Blog

The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.

turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

05-25-2011
08:53 AM

- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content

05-25-2011
08:53 AM

In my previous post i covered some ranking functions including SQL:2008 RANK, ROW_NUMBER & DENSE_RANK.

But Standard SQL defines three additional functions based on ranking:

PERCENT_RANK and CUME_DIST as relative rankings and a percentile calculation named NTILE.

The well-known PERCENT_RANK calculates a relative rank based on the number of rows with a value **less than** the current value.

It returns a number in the range zero to one, assigning zero to the top ranked row:

PERCENT_RANK() OVER (PARTITION BY part_col ORDER BY data_col)

The calculation is based on this formula:

CASE

WHEN COUNT(*) OVER (PARTITION BY part_col) = 1 THEN 0

ELSE

(CAST(RANK() OVER (PARTITION BY part_col ORDER BY data_col) AS FLOAT) - 1)

/(COUNT(*) OVER (PARTITION BY part_col) -1)

END AS "PERCENT_RANK"

PERCENT_RANK is implemented in Teradata, so let's move on to the missing functions.

CUME_DIST returns a relative rank, too, but uses a slightly different calculation based on the number of rows with a value **less than or equal** to the current value.

CUME_DIST() OVER (PARTITION BY part_col ORDER BY data_col ASC)

SQL:2008 defines an equivalent query for CUME_DIST:

CAST(COUNT(*)

OVER (PARTITION BY part_col

ORDER BY data_col

RANGE UNBOUNDED PRECEDING ASC) AS FLOAT)

/ COUNT(*) OVER (PARTITION BY part_col)

Btw, did anyone of you ever read the Standard SQL documents?

Or tried to comprehend what you read there?

An equivalent query like the previous one is a rare case, those papers usually have the wording of a law code and you must be a lawyer (or similar) to read more than a few paragraphs ;-)

The official documents from the ISO-organization are quite expensive, but searching the web reveals the latest drafts which are very close to those, e.g. http://www.wiscorp.com/SQLStandards.html features the SQL:20nn Working Draft Documents of the forthcoming SQL:2011.

Back to the CUME_DIST equivalent.

You'll recognize the definition of the first count from my previous post (#2 Modified competition ranking), it can be directly transformed into:

(COUNT(*) OVER (PARTITION BY part_col)

- CAST(RANK()

OVER (PARTITION BY part_col ORDER BY data_col DESC) AS FLOAT) + 1

) / COUNT(*) OVER (PARTITION BY part_col) AS "CUME_DIST"

Please note once more the reversed sort order.

While CUME_DIST was quite simple NTILE is a real tough one. The name implies a similarity to Teradata's QUANTILE function, but the output might be totally different. If there are duplicate values then QUANTILE will put them in the same bucket whereas NTILE might split them into different groups (and the bucket numbers start at zero for QUANTILE and one for NTILE, but this is easy to fix). I would strongly suggest to reconsider if NTILE is actually needed.

Teradata's QUANTILE is a deprecated feature based on old syntax, thus the manuals show how to rewrite

QUANTILE(B,data_col asc)

...

GROUP BY part_col

to

B * (RANK() OVER (PARTITION BY part_col ORDER BY data_col) - 1)

/ COUNT(*) OVER(PARTITION BY part_col) AS "QUANTILE"

Now replace the RANK in this formula by a ROW_NUMBER and the number of rows per bucket can differ by at most one, so it's *almost* the same as NTILE.

To get *exactly* the same output as NTILE you need to apply some additional logic:

If the number of rows N isn't divisible without reminder by the number of buckets B then the first N MOD B buckets get an extra row.

As i'm not a mathematician the shortest formula i found after some trial and error is the following one:

--NTILE(B) OVER (PARTITION BY part_col ORDER BY data_col)

,COUNT(*) OVER(PARTITION BY part_col) AS N

,ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col) AS rowno

,CASE

WHEN rowno <= ((N/B)+1) * (N MOD B)

THEN (rowno-1) / ((N/B)+1)

ELSE (rowno-1 - (N MOD B)) / (N/B)

END + 1 AS "NTILE"

If anyone can simplify this please raise your hand.

In my next post i will discuss a MEDIAN calculation plus the missing SQL:2008 "inverse distribution functions" PERCENTILE_CONT and PERCENTILE_DISC.

6 Comments

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.