Blog

The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.

turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

05-16-2011
06:11 AM

- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content

05-16-2011
06:11 AM

Some of the SQL:2008 analytical functions are not implemented in Teradata, but most of them can be re-written using existing features.

In this post I will cover the ranking algorithms including DENSE_RANK.

The Wikipedia-article on Ranking lists five algorithms.

Of course all of them return the same output for unique data, but non-unique values will produce different results with duplicate numbers and gaps in the ranking.

Standard SQL proposes syntax for three out of those five variations, while Teradata only implements two of them. In this post I will focus on the missing syntax assuming you're familiar with the existing functions RANK and ROW_NUMBER.

A sequential number.

No duplicate values & no gaps, but non-deterministic, i.e. same row might get different ranking number when query runs a second time .

Implemented in SQL:2008 and Teradata as ROW_NUMBER.

One plus the number of rows with a value *less than the current value*.

Duplicate values & gaps.

Implemented in SQL:2008 and Teradata as RANK.

The number of rows with a value *less than or equal to the current value*.

Duplicate values & gaps.

There's no special syntax for this one in SQL:2008, but it's straightforward:

COUNT(*) OVER (PARTITION BY part_col ORDER BY data_col RANGE UNBOUNDED PRECEDING)

Unfortunately the RANGE syntax in the OLAP window definition is not implemented in Teradata. But *"number of rows with a value less than or equal to the current value"* can also be expressed using a RANK/COUNT combination:

RANK() OVER (PARTITION BY part_col ORDER BY data_col) - 1

+ COUNT(*) OVER (PARTITION BY part_col, data_col)

The previous calculation returns the expected result, but explain shows two "STAT FUNCTION" steps.

The most efficient formula with just a single step is based on a *reversed* order in the RANK:

COUNT(*) OVER (PARTITION BY part_col)

- RANK() OVER (PARTITION BY part_col ORDER BY data_col DESC) + 1

One plus the number of rows with a value* less than the current value* *plus half the number of items equal* to it.

Duplicate values & gaps.

This is also easy to translate into SQL, but the optimizer has to add another STATS step again, which will double runtime:

RANK() OVER (PARTITION BY part_col ORDER BY data_col)

+ (0.5 * (COUNT(*) OVER (PARTITION BY part_col, data_col) - 1))

The number of distinct values less than or equal to the current value.

Duplicate values, but no gaps.

It's the SQL:2008 DENSE_RANK function, which can be expressed in different ways in Teradata, but all of them are much slower than the built-in RANK:

SELECT

t1.part_col

,t1.data_col

,dt.rnk AS "DENSE_RANK"

FROM tab AS t1

JOIN

(

SELECT

part_col

,data_col

,RANK() OVER (PARTITION BY part_col ORDER BY data_col) AS rnk

FROM tab

GROUP BY 1,2

) AS dt

ON t1.part_col = dt.part_col

AND t1.data_col = dt.data_col;

This query has to access the same table or view twice, which is a big disadvantage if it there are joins or complex conditions included.

Following solutions avoid double access, but are based on nested OLAP steps:

SELECT

part_col

,data_col

--,DENSE_RANK() OVER (PARTITION BY part_col ORDER BY data_col) -- not implemented

,SUM(CASE WHEN data_col = prev_data_col THEN 0 ELSE 1 END)

OVER (PARTITION BY part_col

ORDER BY data_col, prev_data_col

ROWS UNBOUNDED PRECEDING) AS "DENSE_RANK"

FROM

(

SELECT

part_col

,data_col

,MIN(data_col)

OVER (PARTITION BY part_col

ORDER BY data_col

ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS prev_data_col

FROM tab

) AS dt;

SELECT

part_col

,data_col

--,DENSE_RANK() OVER (PARTITION BY part_col ORDER BY data_col) -- not implemented

,SUM(CASE WHEN rowno = 1 THEN 1 ELSE 0 END)

OVER (PARTITION BY part_col

ORDER BY data_col, rowno

ROWS UNBOUNDED PRECEDING) AS "DENSE_RANK"

FROM

(

SELECT

part_col

,data_col

,ROW_NUMBER()

OVER (PARTITION BY part_col, data_col

ORDER BY data_col) AS rowno

FROM tab

) AS dt;

If another analytical function must be added to the query you should check if it shares the same PARTITION and ORDER definition and choose the matching version.

Btw, in TD13.10 there's a new "Window Aggregate UDF" feature which allows to create your own OLAP function and the example source code in the manuals is a DENSE_RANK ;-)

The next post in this small series on missing analytical functions will discuss additional SQL:2008 functions based on relative rankings:

CUME_DIST and NTILE.

9 Comments

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.