In database multiple variable linear regression using the CM_Solve Table Operator

Extensibility
Extensibility covers the mechanisms by which you, as the user or developer, can extend the functionality of the Teradata Database, for example with the use of User Defined Functions, or UDFs.
Teradata Employee

In database multiple variable linear regression using the CM_Solve Table Operator

In a prior article [1] we described how to use the Teradata 14.10 CalcMatrix operator and R to perform a multiple variable linear regression analysis. This article extends that concept with a comprehensive in database solution by introducing a new in database table operator named “CM_Solve”. This approach has value in cases when you want to solve a large number of independent systems of equations or you simply do NOT want to use the R client for solving the system of equations based on the SSCP matrix.

CM_Solve Table Operator

CM_Solve is a user defined table operator that needs to be installed on the system prior to use. It takes as input the output from the CalcMatrix table operator and returns the solution coefficients for the linear model that best fits the input system of equations. The following describes the create DDL and the invocation DML.

CM_Solve DDL

The create DDL is as follows:

REPLACE FUNCTION CM_Solve()
 RETURNS TABLE VARYING USING FUNCTION cm_solve_contract
 LANGUAGE CPP NO SQL
 PARAMETER STYLE SQLTABLE
 EXTERNAL NAME 'CS!cm_solve!./cm_solve.cpp'

The source code for CM_Solve is in the attached cm_solve.cpp file. The performance of CM_Solve can be enhanced by using the –O3 gcc compiler option. As you can review in the performance testing section CM_Solve is usually only a small component of the overall processing time. Nonetheless in cases when required, to use the compiler options you need to alter the create DDL to use the object code create method. Follow these general steps.

1: compile the source code into an object file on a Linux system with similar OS software version (SLES10, SLES11) as the target Teradata system OS version. Gcc compile command

    gcc -fpic -I /usr/tdbms/etc -g -O3 -c ./cm_solve.cpp

 2: Modify the create DDL to specify that the table operator will be created from object code.

REPLACE FUNCTION cm_solve()
 RETURNS TABLE VARYING USING FUNCTION cm_solve_contract
 LANGUAGE CPP NO SQL
 PARAMETER STYLE SQLTABLE
 EXTERNAL NAME 'CO!cm_solve!/tmp/cm_solve.o'

 CM_Solve DML

CM_Solve always operates in a global mode so you must specify a partition attribute, even if there is only a single system of equations to be solved. CM_Solve processes the output of the  CalcMatrix LOCAL phase and combines the per AMP matrix subtotals into a single matrix and then uses either the Cholesky or SVD algorithms to solve the system of equations.   The input to CM_Solve consists of one or more matrix identifier columns and 1 or more independent variable columns and a single dependent variable column. The matrix identifier column(s) must be  placed in the USING MATRIX_IDS using clause and the last input column always designates the dependent variable.

CM_Solve USING clause definitions:

    1. MATRIX_IDS: A list of one or more input column names that have been used to partition the input data.
    1. CALCTYPE: The calculation type to be performed, currently only ‘solve’ is supported.

The LOCAL phase of CalcMatrix, USING PHASE('LOCAL'), processes the input data that exists on each AMP and outputs rows of summarized data for that AMP.  In the case of multiple system of equations the input to the LOCAL phase must consist of one or more matrix identifier columns and 1 or more independent variable columns and a single dependent variable column. The matrix identifier columns must be placed in the LOCAL ORDER BY list so that the data arrives ordered by system of equations identifiers.

Example invocation with a single system of equations:

SELECT * FROM CM_Solve(                                -- invoke the CM_Solve table operator
   ON ( SELECT 1 as partno, d1.* FROM CalcMatrix(   -- invoke the CalcMatrix table operator
          ON( SELECT * FROM bigcolumns)                -- Select all of the columns and rows from the input table
          USING PHASE('LOCAL')                         -- designate the CalcMatrix phase
          ) as D1
     )
   PARTITION BY partno                                 -- partition by partno, a constant
   USING MATRIX_IDS('partno')                          -- define the matrix identifier columns
                CALCTYPE('solve')                      -- define the calctype
) D2 

Example invocation with multiple systems of equations:

 SELECT * FROM CM_Solve(                      -- invoke the CM_Solve table operator
   ON ( SELECT * FROM CalcMatrix(        -- invoke the CalcMatrix table operator
          ON( SELECT * FROM bigdata)       -- Select all of the columns and rows from the input table
          LOCAL ORDER BY partno              -- order the ON clause data by “partno” column
          USING PHASE('LOCAL')                -- designate the CalcMatrix phase
          ) as D1
     )
   PARTITION BY partno                        -- partition the ON clause data by “partno” column
   USING MATRIX_IDS('partno')              -- define the matrix identifier columns(s)
                CALCTYPE('solve')             -- define the calctype
) D2 

See appendix A for details on the CM_Solve implementation.

 Correctness Testing

To validate the correctness of CalcMatrix and CM_Solve we compared the results to the R software packages. Specifically we used the R LM function result values for the model coefficients and the coefficient of determination. The coefficient of determination, called R Squared, indicates how well the input data fits the model. It provides a measure of how well observed outcomes are described by the model. R squared ranges in value from 0 to 1 with lower values meaning the linear model does not have a good fit for the input data.

We used the R interpreter to control execution of 500 tests with 1 to 500 independent variables with a random input observation count between 1000 and 4000 values.  We used the Teradata RODBC package and the Teradata ODBC driver to submit the queries to the Teradata database. When comparing for correctness we used a per value error tolerance of 1e-8. For example the following R code fragment would check for an error in the R Squared value with a tolerance of 1e-8

DELTA <- 1e-8
delta <-  abs ( (RRsquared - TRsquared) / RRsquared)
 if (delta > DELTA) {
    print (gettextf("R Squared Error, delta is %1.18f, R=%1.18f, Teradata=%1.18f",delta,RRsquared,TRsquared))
     stop()
 } 

 An example output for 8 independent variables follows:

[1] "Loop 8, input data set size is 3386 -------"
[1] "R Model - formula (y ~ x1+x2+x3+x4+x5+x6+x7+x8)"
[1] "R Squared is 0.001982875583"
[1] "R Model coefficients"
    [1] "4951.983628870" "0.006819239"    "0.019203730"    "-0.028306168" 
    [5] "0.014890097"    "-0.021459214"   "0.010873410"    "0.002893754"  
    [9] "-0.005675700" 
[1] "Teradata Model"
[1] "R Squared is 0.001982875583"
[1] "Teradata Model coefficients"
    [1] "4951.983628870" "0.006819239"    "0.019203730"    "-0.028306168" 
    [5] "0.014890097"    "-0.021459214"   "0.010873410"    "0.002893754"  
    [9] "-0.005675700" 

Background on calculating R squared on Teradata. Assume the data has dependent variable values yi, each of which has an associated modeled value mi. The values yi are called the observed values and the modeled values mi are the predicted values. Also assume "y bar" is the mean of the y values. The general definition of R Squared, the coefficient of determination is

 

   The SQL representation to calculate R Squared, with the 8 variable test case coefficients substituted in the modeled value expression, would be as follows:

SELECT  
    1 –
// SSres
 (  SUM((y - (4951.98362886991  + x1 * 0.006819238823883900  + x2 * 0.019203730297631599  + x3 * -0.028306168031164002  +
x4 * 0.014890097413664800  + x5 * -0.021459213638044299  + x6 * 0.010873410360648900  +
x7 * 0.002893753713620240  + x8 * -0.005675700085017590))**2)    /
//SStot
   SUM ((y- (SELECT AVG(y) FROM indata8))**2) )      
FROM indata8

As a final test we did a visualization comparison between a single independent variable R model and a Teradata model

[1] "Loop 1, input data set size is 2071 -------"
[1] "R Model - formula (y ~ x1)"
[1] "R Squared is 0.582239089150"
[1] "R Model coefficients"
[1] "241.021259378" "0.760719121"
[1] "Teradata Model"
[1] "R Squared is 0.582239089150"
[1] "Teradata Model coefficients"
[1] "241.021259378" "0.760719121"
[1] "Compare Models ... "

Teradata model regression line in Orange and the R model dashed line in blue is overlaid

 Performance Testing

We choose to test two use cases A) a large number of independent systems of equations with a small number of variables and B) a single system of equations with a large number of variables. In both test cases CM_Solve was executed in protected mode.

In test case A we solve 100,000 systems of equations where each system consists of 1000 rows with 24 independent variables. The input table has 100,000,000 rows and the test was performed on a two node 6750 running Teradata database release 14.10.

Input data table DDL:

CREATE TABLE bigdata (partno integer
     ,x1 float, x2 float, x3 float, x4 float, x5 float, x6 float, x7 float, x8 float, x9 float, x10 float
     ,x11 float, x12 float, x13 float, x14 float, x15 float, x16 float, x17 float, x18 float, x19 float, x20 float
     ,x21 float, x22 float, x23 byteint, x24 decimal(18,2)
     ,y float
     ) PRIMARY INDEX (partno);

Query:

In the case of multiple system of equations the input to the LOCAL phase must consist of one or more matrix identifier columns (in this case the single partno column) and 1 or more independent variable columns (in this case x1 – x24) and a single dependent variable column (in this case Y). The matrix identifier columns must be placed in the LOCAL ORDER BY list so that the data arrives ordered by system of equations identifiers.

 In the case of multiple system of equations the input to CM_Solve consist of a varying matrix identifier columns (in this case the partno column) and 1 or more independent variable columns and a single dependent variable column . The matrix identifier columns must be placed in the USING MATRIX_IDS using clause and the last input column is ways designated the dependent variable.

SELECT * FROM CM_Solve(                  -- invoke the CM_Solve table operator
   ON ( SELECT * FROM CalcMatrix(        -- invoke the CalcMatrix table operator
          ON( SELECT * FROM bigdata)     -- Select all of the columns and rows from the input table
          LOCAL ORDER BY partno          -- order the ON clause data by “partno” column
          USING PHASE('LOCAL')           -- designate the CalcMatrix phase
          ) as D1
     )
   PARTITION BY partno                   -- partition the ON clause data by “partno” column
   USING MATRIX_IDS('partno')            -- define the matrix identifier columns(s)
                 CALCTYPE('solve')       -- define the calctype
) D2 

Explain Text for the query

  1)First, we lock a distinct MWW."pseudo table" for read on a RowHash
     to prevent global deadlock for MWW.bigdata.
  2) Next, we lock MWW.bigdata for read.
  3) We do an all-AMPs RETRIEVE step from MWW.bigdata by way of an
     all-rows scan with no residual conditions into Spool 1 (used to
     materialize view, derived table, table function or table operator
     TblOpInputSpool) (all_amps), which is built locally on the AMPs.
     Then we do a SORT to order Spool 1 by the sort key in spool field1.
     The size of Spool 1 is estimated with high confidence to be 501
     rows (109,218 bytes).  The estimated time for this step is 0.09
     seconds.
  4) We do an all-AMPs RETRIEVE step from Spool 1 (Last Use) by way of
     an all-rows scan executing table operator TD_SYSFNLIB.calcmatrix
     with a condition of ("(1=1)") into Spool 2 (used to materialize
     view, derived table, table function or table operator D1)
     (all_amps), which is built locally on the AMPs.  The size of Spool
     2 is estimated with high confidence to be 501 rows (209,418 bytes).
     The estimated time for this step is 0.09 seconds.
  5) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way of
     an all-rows scan into Spool 3 (used to materialize view, derived
     table, table function or table operator TblOpInputSpool)
     (all_amps), which is redistributed by hash code to all AMPs.  Then
     we do a SORT to order Spool 3 by the sort key in spool field1.
     The size of Spool 3 is estimated with high confidence to be 501
     rows (213,426 bytes).  The estimated time for this step is 0.10
     seconds.
  6) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
     an all-rows scan executing table operator MWW.cm_solve with a
     condition of ("(1=1)") into Spool 4 (used to materialize view,
     derived table, table function or table operator D2) (all_amps),
     which is built locally on the AMPs.  The size of Spool 4 is
     estimated with high confidence to be 501 rows (112,725 bytes).
     The estimated time for this step is 0.09 seconds.
  7) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of
     an all-rows scan into Spool 5 (group_amps), which is built locally
     on the AMPs.  The size of Spool 5 is estimated with high
     confidence to be 501 rows (291,582 bytes).  The estimated time for
     this step is 0.09 seconds.
  8) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 5 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.47 seconds.

 

DBQL Performance Data

From the explain you can see that CalcMatrix executes in step 4 and CM_Solve executes in step 6. From the DBQL it is obvious the CalcMatrix dominates the elapsed time and the CPU resource consumption. Also the original input 100,000,000 rows are reduced to 2,500,000 (100,000 systems of 25x25 matrices) rows by CalcMatrix and to the final 100,000 row solution coefficients by CM_Solve.

 

QueryID

Elapsed

sl1

sl2

StepName

CPUTime

IOcount

RowCount

1

307,182,118,593,413,097

 00:00:00.000000

1

0

MLK

0.00

0.00

1.00

2

307,182,118,593,413,097

 00:00:00.000000

2

0

MLK

0.02

0.00

72.00

3

307,182,118,593,413,097

 00:00:04.010000

3

0

RET

225.90

527,522.00

100,000,000.00

4

307,182,118,593,413,097

 00:00:30.320000

4

0

RET

1,747.67

53,747.00

2,500,000.00

5

307,182,118,593,413,097

 00:00:00.260000

5

0

RET

8.24

7,213.00

2,500,000.00

6

307,182,118,593,413,097

 00:00:00.190000

6

0

RET

11.17

3,468.00

100,000.00

7

307,182,118,593,413,097

 00:00:00.020000

7

0

RET

0.94

1,877.00

100,000.00

8

307,182,118,593,413,097

 00:00:00.000000

8

0

Edt

0.12

0.00

72.00

In test case B we solve a single systems of equations with 1,000,000 rows with 999 independent variables and a single dependent variable. The test was performed on the same two node 6750 running Teradata database release 14.10. It should be noted the computation complexity is proportional to  (number of variables)2 * number row input rows

 Query:

In this case the constant 1 is used as partno for CM_Solve

SELECT * FROM CM_Solve(                              -- invoke the CM_Solve table operator
   ON ( SELECT 1 as partno, d1.* FROM CalcMatrix(    -- invoke the CalcMatrix table operator
          ON( SELECT * FROM bigcolumns)              -- Select all of the columns and rows from the input table
          USING PHASE('LOCAL')
          ) as D1
     )
   PARTITION BY partno
   USING MATRIX_IDS('partno')
          CALCTYPE('solve')          -- define the calctype
) D2 

Explain Text for the query:

1) First, we lock a distinct MWW."pseudo table" for read on a RowHash
     to prevent global deadlock for MWW.bigcolumns.
  2) Next, we lock MWW.bigcolumns for read.
  3) We do an all-AMPs RETRIEVE step from MWW.bigcolumns by way of an
     all-rows scan with no residual conditions executing table operator
     TD_SYSFNLIB.calcmatrix with a condition of ("(1=1)") into Spool 2
     (used to materialize view, derived table, table function or table
     operator D1) (all_amps), which is built locally on the AMPs.  The
     size of Spool 2 is estimated with high confidence to be 49,990
     rows (810,537,860 bytes).  The estimated time for this step is
     0.79 seconds.
  4) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way of
     an all-rows scan into Spool 3 (used to materialize view, derived
     table, table function or table operator TblOpInputSpool)
     (all_amps), which is redistributed by hash code to all AMPs.  Then
     we do a SORT to order Spool 3 by the sort key in spool field1.
     The size of Spool 3 is estimated with high confidence to be 49,990
     rows (810,987,770 bytes).  The estimated time for this step is
     3.29 seconds.
  5) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
     an all-rows scan executing table operator MWW.cm_solve with a
     condition of ("(1=1)") into Spool 4 (used to materialize view,
     derived table, table function or table operator D2) (all_amps),
     which is built locally on the AMPs.  The size of Spool 4 is
     estimated with high confidence to be 49,990 rows (800,939,780
     bytes).  The estimated time for this step is 0.79 seconds.
  6) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of
     an all-rows scan into Spool 5 (group_amps), which is built locally
     on the AMPs.  The size of Spool 5 is estimated with high
     confidence to be 49,990 rows (2,200,809,750 bytes).  The estimated
     time for this step is 0.79 seconds.
  7) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 5 are sent back to the user as the result of
     statement 1.  The total estimated time is 5.67 seconds.

DBQL Performance Data:

From the explain you can see that CalcMatrix executes in step 3 and CM_Solve executes in step 5. From the DBQL it is obvious the CalcMatrix dominates the elapsed time and the CPU resource consumption. Also the original input 1,000,000 rows are reduced to 72,000 (1 systems of 1000x1000 matrix on each of the 72 AMPs) rows by CalcMatrix and to the final  1 row solution coefficients by CM_Solve. Note step 5 will be executed on a single AMP.

 

QueryID

Elapsed

sl1

sl2

StepName

CPUTime

IOcount

RowCount

1

307,192,118,593,412,145

 00:00:00.000000

1

0

MLK

0.00

0.00

1.00

2

307,192,118,593,412,145

 00:00:00.000000

2

0

MLK

0.02

0.00

72.00

3

307,192,118,593,412,145

 00:01:24.650000

3

0

RET

5,324.49

33,545.00

72,000.00

4

307,192,118,593,412,145

 00:00:01.690000

4

0

RET

4.10

7,562.00

72,000.00

5

307,192,118,593,412,145

 00:00:03.550000

5

0

RET

4.23

1,572.00

1.00

6

307,192,118,593,412,145

 00:00:00.000000

6

0

RET

0.04

236.00

1.00

7

307,192,118,593,412,145

 00:00:00.000000

7

0

Edt

0.02

0.00

72.00

Append A: CM_Solve algorithm details

 See attached file cm_solve_algo.zip.

References:

[1] http://developer.teradata.com/extensibility/articles/in-database-linear-regression-using-the-calcmat...

[2] CalcMatrix: Chapter 26, SQL Functions, Operators, Expressions, and Predicates Release 14.10 B035-1145-112A May 2013