Ordered Analytical Functions: Translating SQL Functions to Set SQL

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

Many experienced SQL and procedure developers use SQL functions to perform common tasks and simplify their code narratives. The function concept is a good programming technique, but adopting it literally in a Set SQL statement may force the statement to execute as a serial process; that is, it can make parallel processing impossible.  This is one reason Teradata SQL does not allow functions to access tables.

There are many types of user-defined functions (UDFs), and they can be written in C/C++, Java or SQL.  The functions that concern us here are scalar functions that can be written in the procedural languages of other data base systems, such as Oracle PL/SQL and Microsoft SQL Server T-SQL.  Teradata SPL (Stored Procedure Language) offers no facility for writing a scalar function, so there is no direct method of translating this sort of code.

However, most functions that contain only logic and arithmetic can be translated to Teradata SQL functions, and these perform very well.  Deploying them is different, though, because they cannot simply be defined within a stored procedure: they are created with the Create Function SQL DDL command, and permissions are granted just as with any other function.

Functions that issue SQL commands cannot be translated to Teradata functions, but the data access and logic in such functions can almost always be translated to a form of Set SQL that is incorporated in the main query as a derived table or a view.

To illustrate these methods, let's look at some examples of logic implemented with functions in other database systems and the translation to Teradata set-level processing.

First, consider an example of an Update statement in a procedure using a function that could be used in parallel processing. The Update statement calls a function that contains only logic and does not access any tables.

PROCEDURE Update_Table1() IS
BEGIN

FUNCTION Get_Desc_ID(Descr varchar2) RETURN INTEGER IS
DescID INTEGER := 0;
BEGIN
 IF Descr = 'High' then DescID := 4;
 ELSIF Descr = 'Medium' then DescID := 3;
 ELSIF Descr = 'Low' then DescID := 2;
 ELSIF Descr = 'Trivial' then DescID := 1;
 RETURN DescID;
END Get_Desc_ID;

UPDATE Table1
Set DescID = Get_Desc_ID(Table1.Desc);
END Update_Table1;

In Teradata SQL we can create an equivalent SQL function:

REPLACE FUNCTION Get_Desc_ID(Descr varchar2) RETURNS INTEGER
LANGUAGE SQL
CONTAINS SQL
RETURNS NULL ON NULL INPUT
DETERMINISTIC
SQL SECURITY DEFINER
COLLATION INVOKER
INLINE TYPE 1
RETURN CASE Descr WHEN 'High' THEN 4
 WHEN 'Medium' THEN 3
 WHEN 'Low' THEN 2
 WHEN 'Trivial' THEN 1
 ELSE 0
END

Note that there are several phrases required in Teradata SQL Function Create statements, but most of these will be the same for every SQL function.

Now consider a different implementation of this function that does a table lookup rather than burying the values within the function code:

PROCEDURE Update_Table2() IS
BEGIN

FUNCTION Get_Desc_ID(Descr varchar2) RETURN INTEGER IS
DescID INTEGER := 0;
BEGIN
 SELECT DescID INTO :DescID FROM DESC_TABLE WHERE DESCR = :Descr;
 RETURN DescID;
END Get_Desc_ID;

UPDATE Table2
Set DescID = Get_Desc_ID(Table2.Descr);
END Update_Table2;

This function issues an SQL command. If the database system executes this literally as written, then, for each row in Table2, it will have to scan the DESC_TABLE looking for a match. This may not be much overhead for a sequential database that can buffer all of DESC_TABLE so that it stays in memory, but a parallel processing database would need to buffer a copy of DESC_TABLE for each unit of parallelism.  When the process becomes more complex, as we shall see, buffering becomes impractical.

As noted, Teradata functions cannot access tables, so the table access in this function is rewritten as a join in the main query, like this:

UPDATE Table1
FROM DESC_TABLE
SET DescID = DESC_TABLE.DescID
WHERE Table1.Descr = DESC_Table.Descr;

To summarize to this point:

1. Teradata functions can be written in SQL language, but they cannot access tables. This limits their power to what can be expressed in pure logic (CASE operations), arithmetic and data type transformations.

2. If a function from another database system is being converted to Teradata, and it accesses data, then it must be rewritten as something that can be joined to the SQL statements that use it.

Now consider a complex function that contains both logic and DML statements: one that computes the average daily balance of a bank account from all the transactions in a given month.  Think of a table of transactions that looks like this:

Account Date Time Amout Balance
1 2013-10-02 15:32:33 1000.00 2500.00
1 2013-10-02 16:44:44  500.00 3000.00
1 2013-10-02 16:44:45  700.00 3700.00
1 2013-10-03 12:00:00 -700.00 3000.00
2 2013-10-04 09:12:34 1000.00 1000.00
2 2013-10-04 12:00:00 -500.00  500.00
3 2013-10-14 12:00:00 -500.00  500.00

The average daily balance is the sum of all the balances in an account for each day in a month, divided by the number of days in a month.  The transaction table does not contain a record for each day of the month, but we know that the balance is constant between two sequential transactions.  So if the balance in Account 1 is 3000 on October 3, it is also 3000 on October 4, 5, and so on until the next transaction.  We can quickly add the balances for each day between transactions by multiplying the balance after the first transaction by the number of days between transactions.

This gives the balance from the day of the first transaction through the end of the month, but we still need to compute the balance for the days preceding the first transaction.  We know that it must be the balance shown in the first transaction, minus the transaction amount, and we can multiply this by the number of days preceding that transaction for the month.

Given the transaction data for Account 1, we can visualize the balance for each day like so:

2013-10-01 1500.00

2013-10-02 3700.00

2013-10-03 3000.00

2013-10-04 3000.00

2013-10-05 3000.00

...

2013-10-30 3000.00

2013-10-31 3000.00

The average daily balance for Account 1 is ( 1500 + 3700 + 3000 + ... + 3000 ) / 31, or, stated more simply, ( 1500 + 3700 + 3000*29 ) / 31.

The average daily balance for Account 2 is ( 0*3 + 500*28 ) / 31.  The average daily balance for Account 3 is ( 1000*13 + 500*18 ) / 31.

Here is the pseudo code for a function that would compute this. It requires that transactions be sorted by Account Date and Time.

Read the first transaction date, amount and balance for Account X in Month M.
Multiply ( balance - amount) by the number of days in the month preceding the transaction.
Store the result in Accum_Balance.
Store the date and balance from the first transaction.
DO:
 Read next transaction; Exit loop if no-more-transactions
 If the date is different from the prior transaction:
  multiply the days between this transaction and the prior by the prior balance.
  Add the result to Accum_Balance.
 Save the date and balance.
END-DO
Multiply the balance from the last transaction by the number of days remaining in the month.
Add the result to Accum_Balance.
Return Accum_Balance divided by the number of days in Month M.

There are three challenges with this function: it contains complex logic, it reads a table, and the data has to be processed sequentially.

The fact that it reads a table tells us that this function should be implemented as a derived table within any Set SQL statement that needs to use it.  If it is used in several places, it might be stated as a view.

The fact that this process needs to see data in a particular sequence tells us that, if we can do it in Set SQL, it will have to use an ordered analytical function.

Take the steps in the function one at a time and consider how they might be done in Set SQL.  The first step is to compute the balance before the first transaction.  This requires that we read the first transaction.  The SQL will look like this:

select acctid, txndate, bal - amt
 ,extract(day from txndate) - 1 as NrDays
from TxnTable
qualify [transaction date and time]
 = min( [transaction date and time] ) over(partition by AcctID)
 and extract(month from txndate) = :vMonth

The predicate has to use an ordered analytical function to identify the first transaction date-time, so the keyword QUALIFY is used instead of WHERE.

To handle the loop on transactions by Account and date-time, we can use another ordered analytical function:

select acctid, txndate, bal
 ,coalesce(
  extract ( day from
   max(txndate) over(partition by AcctID order by txndate
    rows between 1 following and 1 following)
  ),
  :vMonthDays) NextDay
 ,NextDay + 1 - extract(day from txndate) as NrDays
from
 ( select acctid, txndate, bal from txntable
  qualify txntime = max(txntime) over(partition by AcctID, txndate)
   and extract(month from txndate) = :vMonth

The expression:

  extract ( day from
   max(txndate) over(partition by AcctID order by txndate
    rows between 1 following and 1 following) )

returns the day of the month of the next transaction following the current row. If this is the last transaction of the month, there is no next transaction, so coalesce() returns the last day of the month in that case.

The derived table: 

   select acctid, txndate, bal from txntable
  qualify txntime = max(txntime) over(partition by AcctID, txndate)
   and extract(month from txndate) = :vMonth

contains the last transaction of each day; the other transactions that occurred that day are not needed to calculate the ending balance for the day.

Now we have the balances and the number of days for each balance. The UNION set operator combines the calculation for the beginning of the month with the calculations on the other days.  All that remains is to add these numbers together and divide by the number of days in the month; this is easy if we place the UNION of the two computations in a derived table. 

 select acctid, sum(bal*NrDays)/:vMonthDays
 from (
  select acctid, txndate, bal
   ,coalesce(
    extract ( day from
    max(txndate) over(partition by AcctID order by txndate
      rows between 1 following and 1 following)
    ),
    :vMonthDays) NextDay
   ,NextDay + 1 - extract(day from txndate) as NrDays
  from
   ( select acctid, txndate, bal from txntable
   qualify txntime = max(txntime) over(partition by AcctID, txndate)
        and extract(month from txndate) = :vMonth
   ) M
 union
  /* Compute balance from the beginning of the month to 1st transaction */
  select acctid, txndate, bal - amt
   ,0 NextDay
   ,extract(day from txndate) - 1 as NrDays
  from txntable
  qualify cast( cast(txndate as timestamp(0)) +
  cast(cast(txntime as char(8)) as interval hour to second) as timestamp(0) )
   = min( cast( cast(txndate as timestamp(0)) +
  cast(cast(txntime as char(8)) as interval hour to second) as timestamp(0) ) )
   over(partition by AcctID)
      and extract(month from txndate) = :vMonth
 ) T (acctid, txndate, bal, NextDay, NrDays)
 group by 1

If this Select statement were a view, it would be easy to generate a report using a Select statement or update the Accounts table with an Update-join.  However, a view cannot contain host variables. This Select uses the two host variables vMonth and vMonthDays, which tell us the month and how many days it has.  One way to solve this is to assume this will be run only on the prior month's data: then one could write a couple of simple SQL functions like Get_Desc_ID, above, that return the prior month and the number of days.  Another option is to put these values in a lookup table and join to that table in the outer select statement.

If this does not have to be a view, then it can be a derived table in a stored procedure that takes the month (and perhaps the number of days) as parameters.  An Update statement in such a procedure would look like this: 

UPDATE Cust_Table
FROM
(
 select acctid, sum(bal*NrDays) / :vMonthDays
 from (
  select acctid, txndate, bal
   ,coalesce(
    extract ( day from
    max(txndate) over(partition by AcctID order by txndate
      rows between 1 following and 1 following)
    ),
    :vMonthDays) NextDay
   ,NextDay + 1 - extract(day from txndate) as NrDays
  from
   ( select acctid, txndate, bal from txntable
   qualify txntime = max(txntime) over(partition by AcctID, txndate)
        and extract(month from txndate) = :vMonth
   ) M
 union
  /* Compute balance from the beginning of the month to 1st transaction */
  select acctid, txndate, bal - amt
   ,0 NextDay
   ,extract(day from txndate) - 1 as NrDays
  from txntable
  qualify cast( cast(txndate as timestamp(0)) +
  cast(cast(txntime as char(8)) as interval hour to second) as timestamp(0) )
   = min( cast( cast(txndate as timestamp(0)) +
  cast(cast(txntime as char(8)) as interval hour to second) as timestamp(0) ) )
    over(partition by AcctID)
      and extract(month from txndate) = :vMonth
 ) T (acctid, txndate, bal, NextDay, NrDays)
 group by 1
) AvgDailyBal (AcctID, Bal)

Set ADB = AvgDailyBal.Bal
WHERE  Cust_Table.ACCTID = AvgDailyBal.AcctID

We frequently encounter stored procedures that contain complex PL/SQL or T-SQL functions, but functions are certainly not required for complex operations.  Rewriting them as views or derived tables, using ordered analytical functions when needed, simplifies code maintenance and enables parallel set processing.

3 Comments
Teradata Employee

Thanks George for this excellent information. I wish I had read your post few months back - would have saved me lot of headaches on one of our migration project !!

regards

rajesh srivastava

MCOE , Teradata

This looks hard, but I'm excited to try it. 

I've got a cursor that is calling a query which contains four joined select statements and each contains a scalar sub-query that uses an analytical function. The query is very expensive (currently takes 106 seconds per execution, and my cursor calls it 3,000 times). Current processing time is 4 days (unacceptable performance). 

This looks like it could be my answer. Thanks for the post.

Teradata Employee

Yes, this does sound hard, but it sounds like something you can solve with some patience.  I would hope you could find a way to turn those scalar subqueries into selects that can be joined with the other selects.  The result might be something that looks real complex to everyone else, but you could write comments to explain how it works or what it is supposed to accomplish.  In fact, sometimes just documenting a process gives one an idea of how to implement it in Set SQL.