Derived Tables with Ordered Analytical Functions to Translate Flat Logic

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

Ordered analytical functions enable parallel processing by freeing processes from flat logic. They also produce simpler Set SQL statements, which as a general rule can be expected to perform better than more complex Set SQL statements.

These functions simplify programming associated with sequential processing and avoid the use of cursor loops and offline processing. Also known as windowing or OLAP functions, ordered analytical functions enable the programmer to create Set SQL processes that depend on data being sequenced or grouped in some way.

An ordered analytical function sorts, groups or aggregates subsets of columns within a select statement while keeping other selected columns intact. The following example includes ROW_NUMBER, a sorting function:

Select Employee_Name, ROW_NUMBER() Over(Order By Employee_Name) from Employees;

This produces a list of names together with a number that represents their alphabetical ranking. The ROW_NUMBER() function is often used to generate sequentially assigned surrogate keys.

The PARTITION BY and ROWS BETWEEN phrase of the OVER() clause can be used to aggregate rows over various groupings or "windows."  Each row in this result set contains an account's daily balance together with its monthly minimum and maximum balance:

Select Account_Number,

Balance_Date,

Daily_Balance,

 MIN(Daily_Balance)

Over(Partition by Account_Number, Balance_Month) as Monthly_Min,

 MAX(Daily_Balance)

Over(Partition by Account_Number, Balance_Month) as Monthly_Max,

From Account_Balances;

The following statement produces a list of measurements over time together with their increments or decrements. In this case the "group" is just one row - the row preceding the current row in time sequence:

Select Time_Of_Day,

Height,

Height - MIN(Height)

Over(Order by Time_of_Day Rows Between 1 preceding and 1 preceding)

as Delta

from Rocket_Position;

The statement could just as well use MAX() or SUM() instead of MIN().  It really only needs to retrieve "THE(Height) Over( <the prior row in time sequence> )," but there is no THE() function. That's okay,  because MIN(), MAX() and SUM() each retrieve the same result over one row.

If it weren't for ordered analytical functions, the processes in the Row Number and Height Delta examples would require sequence-oriented cursor loops. The Account Balance process would require a product join of a table to two aggregations of itself.

Three clues indicate that ordered analytical functions will be required in the Set SQL translation of a cursor-based process. The first is odd groupings or counters that cannot be clearly expressed with a simple Set SQL GROUP BY clause. Another is complex If-Then-Else statements that appear difficult or impossible to translate to Set SQL CASE operations. The third is the use of ORDER BY in the cursor, since ORDER BY usually suggests that rows need to be processed in a certain sequence. (Note that it is not uncommon for programmers to use  ORDER BY when a sorting function is not necessary. Be sure to read the logic to determine that the sequence of rows matters.)

The following flat logic example contains all these complexities. This is written in Teradata Stored Procedure Language (SPL):

Declare v_ID Integer;

Declare CountLimit1 Integer Default 0;

Declare CountLimit2 Integer Default 0;

Declare CountLimit3 Integer Default 0;

DECLARE CountCursor CURSOR FOR

    Select Sum(DT.Limit2)

    From (Select (Case When (Customer.HM = 'KX' ) Then 1 Else 0 End ) Limit2

          From Customer) DT ;

DECLARE SplitCursor CURSOR FOR

    Select Customer.ID,

        (Case When (Customer.DC = 1) Then 1 Else 0 End ) CRITERIA1,

        (Case When (Customer.HM = 'KX' ) Then 1 Else 0 End ) CRITERIA2

    From Customer

    Order By Customer.CustomerSince Asc;

OPEN CountCursor;

FETCH CountCursor INTO CountLimit2;

CLOSE CountCursor;

SET PercentLimit2 = CAST(CountLimit2 * 0.3 AS INTEGER); -- Rounding Off

SET CountLimit2 = 0;

OPEN SplitCursor;

FETCH SplitCursor INTO v_ID, v_crit1, v_crit2;

WHILE (SQLCODE=0) DO

    IF v_crit1 > 0 AND CountLimit1 < 120

    THEN

        SET CountLimit1=CountLimit1 + 1;

        INSERT INTO ResultTable VALUES (v_ID,'Code_0001');

    ELSEIF v_crit2 > 0 AND CountLimit2 < PercentLimit2

    THEN

        SET CountLimit2=CountLimit2 + 1;

        INSERT INTO ResultTable VALUES (v_ID,'Code_0002');

    ELSE

        SET CountLimit3=CountLimit3 + 1;

        INSERT INTO ResultTable VALUES (v_ID,'Code_0003');

    END IF;

    FETCH SplitCursor INTO v_ID, v_crit1, v_crit2;

END WHILE;

CLOSE SplitCursor;

END P1;

This logic is easier to grasp by rewriting it in pseudo code. Pseudo code is not a programming language but a sort of short hand that you can make up as you go along. The point of this exercise is that reviewing the logic during the rewrite process highlights the tricky parts. Here is one pseudo code version:

CountCursor =

Select Sum(DT.Limit2) From (Select (Case When (Customer.HM = 'KX' ) Then 1 Else 0 End ) Limit2 From Customer) DT ;

# FETCH CountCursor INTO CountLimit2;

# SET PercentLimit2 = CAST(CountLimit2 * 0.3 AS INTEGER);

PercentLimit2 =

    ( Sum ( Select Case When Customer.HM = 'KX' Then 1 Else 0 End From Customer )

    ) * 0.3 (integer); # CountCursor

CountLimit1 = CountLimit2 = CountLimit3 = 0;

SplitCursor =

    Select Customer.ID,

    (Case When Customer.DC = 1 Then 1 Else 0 End ) CRITERIA1,

    (Case When Customer.HM = 'KX' Then 1 Else 0 End ) CRITERIA2

From Customer

Order By Customer.CustomerSince Asc ;

For each row in SplitCursor (squenced by CustomerSince) do

    IF Customer.CRITERIA1 > 0 AND CountLimit1 < 120

        increment CountLimit1

        INSERT INTO ResultTable VALUES (Customer.ID,'Code_0001');

    ELSEIF Customer.CRITERIA2 > 0 AND CountLimit2 < PercentLimit2

        increment CountLimit2

        INSERT INTO ResultTable VALUES (Customer.ID,'Code_0002');

    ELSE

        increment CountLimit3

        INSERT INTO ResultTable VALUES (Customer.ID,'Code_0003');

    END IF;

EndDo;

Write the pseudo code in a way that is comfortable for you and makes it easy to visually cross-reference the different parts of the code. In this case, I commented-out the FETCH and SET commands on CountCursor when I realized that CountCursor is only used to compute PercentLimit2 and that I could write that as a single expression. Once we have that expression, we can basically ignore CountCursor.

The complex logic in the loop on SplitCursor becomes obvious as we write the pseudo code because it highlights the reason for the ORDER BY clause. Unlike CountCursor, there is no straightforward way to simplify its use.

For each row in the cursor we do one of three inserts and increment one of three counters.  All three inserts go into the same target table. There will be no more than 120 rows of 'Code_0001' inserted, and there will be no more than "PercentLimit2" rows inserted with 'Code_0002'. The rest will be 'Code_0003'.  Note that CountLimit3 is incremented but not used anywhere else in the procedure, so we can ignore that variable.

Now let's make a start at writing the INSERT Set SQL statement. This still contains some pseudo code.

INSERT INTO ResultTable

Select Customer.ID

    ,CASE

        when Customer.Criteria1 > 0 and CountLimit1 < 120

        then 'Code_0001'

        when Customer.Criteria2 > 0 and CountLimit2 < PercentLimit2

        then 'Code_0002'

        else 'Code_0003'

     END

FROM [SplitCursor-With-Counters-Ordered-By-Customer-Since]

CROSS JOIN [CountCursor-to-get-PercentLimit2]

;

Just as in my earlier blog entries, we represent the cursors from the procedural code as derived tables. The derived tables are named after the cursors they represent; this will make it easier to correlate the Set SQL logic to the procedural logic.

This INSERT statement will do one of three things, as in the procedural code. One of the values it needs is PercentLimit2, and it can get this from a separate SQL statement that will compute this constant in the derived table we are calling CountCursor.

The CountCursor derived table will be built from this SQL statement that was developed in the pseudo code:

Select cast(

    Sum (

        Case When Customer.HM = 'KX' Then 1 Else 0 End

    ) * 0.3 as INTEGER ) PercentLimit2

  From Customer

"CROSS JOIN" is used because the CountCursor derived table contains one constant row value that is used in all the other row expressions. We could have used a comma-separated list of derived tables in the FROM clause, but the use of the CROSS JOIN keywords makes it clear to the reader that we know this is a product join and we are doing it for good reason.

Since PercentLimit2 is a constant that is used in the main select statement, and since we are writing a stored procedure, we could just as easily have selected that computed value into a host variable and used that host variable in the main SQL statement. Either method works.

The challenging part of this logic involves the way the flat logic uses SplitCursor. Here we need to create a derived table, SplitCursor, which calculates counters based on the sequence Customer-Since.

Consider the first conditional insert in the flat logic cursor loop, which requires the first 120 rows that meet Criteria1. This SELECT statement generates CountLimit1, which will be used to identify those rows:

Select ID

  ,( Case When ( DC = 1 ) Then 1 Else 0 End ) CRITERIA1

  ,sum ( CRITERIA1 ) over ( order by CustomerSince Asc

    rows between unbounded preceding and current row ) - 1 CountLimit1

  ,HM

  ,CustomerSince

From Customer

The sum of CRITERIA1 represents CountLimit1 in the cursor loop; it is decremented by 1 so that it begins with 0, as in the procedure. Each of the rows retrieved here will contain a distinct CountLimit1 value, starting with 0.

Now for the hard part: we also need to find the first PercentLimit2 rows that meet CRITERIA2, but we cannot include the CRITERIA1 rows. When the flat logic is looping through the first 120 rows that meet CRITERIA1, some of them might also meet CRITERIA2, but the logic ignores those. CRITERIA1 takes precedence until 120 rows have been selected.

Therefore, the Set SQL must be constructed such that the selections for CRITERIA1 and CRITERIA2 are separate. The CRITERIA1 rows must be selected first, and then the CRITERIA2 rows can be selected from the remaining rows. This means that CRITERIA2 rows must be selected from a derived table that identifies the CRITERIA1 rows.

Select ID as v_ID

  ,CRITERIA1 as v_crit1

  ,CountLimit1

     /* This CASE skips over the CRITERIA1 rows: */

  ,( Case When ( HM = 'KX' AND ( CountLimit1 >= 120 OR CRITERIA1 <> 1 ) )

     Then 1 Else 0 End ) as v_crit2

  ,sum ( v_crit2 ) over ( order by CustomerSince Asc

    rows between unbounded preceding and current row ) - 1 as CountLimit2

FROM (

    /* Find all the CRITERIA1 rows first */

    Select ID

      ,( Case When ( DC = 1 ) Then 1 Else 0 End ) CRITERIA1

      ,sum ( CRITERIA1 ) over ( order by CustomerSince Asc

        rows between unbounded preceding and current row ) - 1 CountLimit1

      ,HM  /* Needed to compute CRITERIA2 */

      ,CustomerSince

    From Customer

) Crit1_Count

For integration into the main body of the SQL, name the columns of the derived table after the variable names that are used in the procedural code. This makes it easier to see the correlation between the set logic and the flat logic.

Finally, turning this into an INSERT statement results in a derived table that contains another derived table (CRITERIA1).

INSERT INTO ResultTable

SELECT

   SPLITCURSOR.v_ID

  ,case

    when SPLITCURSOR.v_crit1 > 0

      and SPLITCURSOR.CountLimit1 < 120

    then 'Code_0001'

    when SPLITCURSOR.v_crit2 > 0

      and SPLITCURSOR.CountLimit2 < COUNTCURSOR.PercentLimit2

    then 'Code_0002'

    else 'Code_0003'

   end

FROM

 (

    Select ID as v_ID

      ,CRITERIA1 as v_crit1

      ,CountLimit1

      ,( Case When ( HM = 'KX' AND ( CountLimit1 >= 120 OR CRITERIA1 <> 1 ) )

         Then 1 Else 0 End ) as v_crit2

      ,sum ( v_crit2 ) over ( order by CustomerSince Asc

        rows between unbounded preceding and current row ) - 1 as CountLimit2

    FROM (

        Select ID

          ,( Case When ( DC = 1 ) Then 1 Else 0 End ) CRITERIA1

          ,sum ( CRITERIA1 ) over ( order by CustomerSince Asc

            rows between unbounded preceding and current row ) - 1 CountLimit1

          ,HM

          ,CustomerSince

        From Customer

    ) Crit1_Count

 ) SPLITCURSOR

CROSS JOIN (  /* CROSS JOIN makes the intent clear to the reader */

  Select cast( Sum ( Case

            When ( Customer.HM = 'KX' ) Then 1

            Else 0

            End

        ) * 0.3 as INTEGER) PercentLimit2

      From Customer

 ) COUNTCURSOR ( PercentLimit2 )

;

If Customer is a large table (on the order of 10,000 rows or more), this Set SQL will perform many times faster than the flat logic in the stored procedure that uses a cursor loop, because Set SQL keeps the data in the database, enabling parallel processing.

Is there a simpler way to do this?  Does this logic truly require nested derived tables? It is not difficult to make up some test data, run both the original procedure and the Set SQL and compare the results.  For testing, you can change the limit on CountLimit1 to a smaller number like 5. Comments are welcome!

Ordered analytical functions are documented in the Teradata SQL Functions manual. There are several excellent Teradata-Partners presentations on the topic, and another blog on this site, Über SQL, documents several useful examples.