In a prior article [1] we described how to use the Teradata 14.10 CalcMatrix operator and R to perform a multiple variable linear regression analysis. This article extends that concept with a comprehensive in database solution by introducing a new in database table operator named “CM_Solve”. This approach has value in cases when you want to solve a large number of independent systems of equations or you simply do NOT want to use the R client for solving the system of equations based on the SSCP matrix.
CM_Solve is a user defined table operator that needs to be installed on the system prior to use. It takes as input the output from the CalcMatrix table operator and returns the solution coefficients for the linear model that best fits the input system of equations. The following describes the create DDL and the invocation DML.
CM_Solve DDL
The create DDL is as follows:
REPLACE FUNCTION CM_Solve()
RETURNS TABLE VARYING USING FUNCTION cm_solve_contract
LANGUAGE CPP NO SQL
PARAMETER STYLE SQLTABLE
EXTERNAL NAME 'CS!cm_solve!./cm_solve.cpp'
The source code for CM_Solve is in the attached cm_solve.cpp file. The performance of CM_Solve can be enhanced by using the –O3 gcc compiler option. As you can review in the performance testing section CM_Solve is usually only a small component of the overall processing time. Nonetheless in cases when required, to use the compiler options you need to alter the create DDL to use the object code create method. Follow these general steps.
1: compile the source code into an object file on a Linux system with similar OS software version (SLES10, SLES11) as the target Teradata system OS version. Gcc compile command
gcc -fpic -I /usr/tdbms/etc -g -O3 -c ./cm_solve.cpp
2: Modify the create DDL to specify that the table operator will be created from object code.
REPLACE FUNCTION cm_solve()
RETURNS TABLE VARYING USING FUNCTION cm_solve_contract
LANGUAGE CPP NO SQL
PARAMETER STYLE SQLTABLE
EXTERNAL NAME 'CO!cm_solve!/tmp/cm_solve.o'
CM_Solve DML
CM_Solve always operates in a global mode so you must specify a partition attribute, even if there is only a single system of equations to be solved. CM_Solve processes the output of the CalcMatrix LOCAL phase and combines the per AMP matrix subtotals into a single matrix and then uses either the Cholesky or SVD algorithms to solve the system of equations. The input to CM_Solve consists of one or more matrix identifier columns and 1 or more independent variable columns and a single dependent variable column. The matrix identifier column(s) must be placed in the USING MATRIX_IDS using clause and the last input column always designates the dependent variable.
CM_Solve USING clause definitions:
The LOCAL phase of CalcMatrix, USING PHASE('LOCAL'), processes the input data that exists on each AMP and outputs rows of summarized data for that AMP. In the case of multiple system of equations the input to the LOCAL phase must consist of one or more matrix identifier columns and 1 or more independent variable columns and a single dependent variable column. The matrix identifier columns must be placed in the LOCAL ORDER BY list so that the data arrives ordered by system of equations identifiers.
Example invocation with a single system of equations:
SELECT * FROM CM_Solve( -- invoke the CM_Solve table operator
ON ( SELECT 1 as partno, d1.* FROM CalcMatrix( -- invoke the CalcMatrix table operator
ON( SELECT * FROM bigcolumns) -- Select all of the columns and rows from the input table
USING PHASE('LOCAL') -- designate the CalcMatrix phase
) as D1
)
PARTITION BY partno -- partition by partno, a constant
USING MATRIX_IDS('partno') -- define the matrix identifier columns
CALCTYPE('solve') -- define the calctype
) D2
Example invocation with multiple systems of equations:
SELECT * FROM CM_Solve( -- invoke the CM_Solve table operator
ON ( SELECT * FROM CalcMatrix( -- invoke the CalcMatrix table operator
ON( SELECT * FROM bigdata) -- Select all of the columns and rows from the input table
LOCAL ORDER BY partno -- order the ON clause data by “partno” column
USING PHASE('LOCAL') -- designate the CalcMatrix phase
) as D1
)
PARTITION BY partno -- partition the ON clause data by “partno” column
USING MATRIX_IDS('partno') -- define the matrix identifier columns(s)
CALCTYPE('solve') -- define the calctype
) D2
See appendix A for details on the CM_Solve implementation.
To validate the correctness of CalcMatrix and CM_Solve we compared the results to the R software packages. Specifically we used the R LM function result values for the model coefficients and the coefficient of determination. The coefficient of determination, called R Squared, indicates how well the input data fits the model. It provides a measure of how well observed outcomes are described by the model. R squared ranges in value from 0 to 1 with lower values meaning the linear model does not have a good fit for the input data.
We used the R interpreter to control execution of 500 tests with 1 to 500 independent variables with a random input observation count between 1000 and 4000 values. We used the Teradata RODBC package and the Teradata ODBC driver to submit the queries to the Teradata database. When comparing for correctness we used a per value error tolerance of 1e-8. For example the following R code fragment would check for an error in the R Squared value with a tolerance of 1e-8
DELTA <- 1e-8
delta <- abs ( (RRsquared - TRsquared) / RRsquared)
if (delta > DELTA) {
print (gettextf("R Squared Error, delta is %1.18f, R=%1.18f, Teradata=%1.18f",delta,RRsquared,TRsquared))
stop()
}
An example output for 8 independent variables follows:
[1] "Loop 8, input data set size is 3386 -------"
[1] "R Model - formula (y ~ x1+x2+x3+x4+x5+x6+x7+x8)"
[1] "R Squared is 0.001982875583"
[1] "R Model coefficients"
[1] "4951.983628870" "0.006819239" "0.019203730" "-0.028306168"
[5] "0.014890097" "-0.021459214" "0.010873410" "0.002893754"
[9] "-0.005675700"
[1] "Teradata Model"
[1] "R Squared is 0.001982875583"
[1] "Teradata Model coefficients"
[1] "4951.983628870" "0.006819239" "0.019203730" "-0.028306168"
[5] "0.014890097" "-0.021459214" "0.010873410" "0.002893754"
[9] "-0.005675700"
Background on calculating R squared on Teradata. Assume the data has dependent variable values yi, each of which has an associated modeled value mi. The values yi are called the observed values and the modeled values mi are the predicted values. Also assume "y bar" is the mean of the y values. The general definition of R Squared, the coefficient of determination is
The SQL representation to calculate R Squared, with the 8 variable test case coefficients substituted in the modeled value expression, would be as follows:
SELECT
1 –
// SSres
( SUM((y - (4951.98362886991 + x1 * 0.006819238823883900 + x2 * 0.019203730297631599 + x3 * -0.028306168031164002 +
x4 * 0.014890097413664800 + x5 * -0.021459213638044299 + x6 * 0.010873410360648900 +
x7 * 0.002893753713620240 + x8 * -0.005675700085017590))**2) /
//SStot
SUM ((y- (SELECT AVG(y) FROM indata8))**2) )
FROM indata8
As a final test we did a visualization comparison between a single independent variable R model and a Teradata model
[1] "Loop 1, input data set size is 2071 -------"
[1] "R Model - formula (y ~ x1)"
[1] "R Squared is 0.582239089150"
[1] "R Model coefficients"
[1] "241.021259378" "0.760719121"
[1] "Teradata Model"
[1] "R Squared is 0.582239089150"
[1] "Teradata Model coefficients"
[1] "241.021259378" "0.760719121"
[1] "Compare Models ... "
Teradata model regression line in Orange and the R model dashed line in blue is overlaid
We choose to test two use cases A) a large number of independent systems of equations with a small number of variables and B) a single system of equations with a large number of variables. In both test cases CM_Solve was executed in protected mode.
In test case A we solve 100,000 systems of equations where each system consists of 1000 rows with 24 independent variables. The input table has 100,000,000 rows and the test was performed on a two node 6750 running Teradata database release 14.10.
Input data table DDL:
CREATE TABLE bigdata (partno integer
,x1 float, x2 float, x3 float, x4 float, x5 float, x6 float, x7 float, x8 float, x9 float, x10 float
,x11 float, x12 float, x13 float, x14 float, x15 float, x16 float, x17 float, x18 float, x19 float, x20 float
,x21 float, x22 float, x23 byteint, x24 decimal(18,2)
,y float
) PRIMARY INDEX (partno);
Query:
In the case of multiple system of equations the input to the LOCAL phase must consist of one or more matrix identifier columns (in this case the single partno column) and 1 or more independent variable columns (in this case x1 – x24) and a single dependent variable column (in this case Y). The matrix identifier columns must be placed in the LOCAL ORDER BY list so that the data arrives ordered by system of equations identifiers.
In the case of multiple system of equations the input to CM_Solve consist of a varying matrix identifier columns (in this case the partno column) and 1 or more independent variable columns and a single dependent variable column . The matrix identifier columns must be placed in the USING MATRIX_IDS using clause and the last input column is ways designated the dependent variable.
SELECT * FROM CM_Solve( -- invoke the CM_Solve table operator
ON ( SELECT * FROM CalcMatrix( -- invoke the CalcMatrix table operator
ON( SELECT * FROM bigdata) -- Select all of the columns and rows from the input table
LOCAL ORDER BY partno -- order the ON clause data by “partno” column
USING PHASE('LOCAL') -- designate the CalcMatrix phase
) as D1
)
PARTITION BY partno -- partition the ON clause data by “partno” column
USING MATRIX_IDS('partno') -- define the matrix identifier columns(s)
CALCTYPE('solve') -- define the calctype
) D2
Explain Text for the query
1)First, we lock a distinct MWW."pseudo table" for read on a RowHash
to prevent global deadlock for MWW.bigdata.
2) Next, we lock MWW.bigdata for read.
3) We do an all-AMPs RETRIEVE step from MWW.bigdata by way of an
all-rows scan with no residual conditions into Spool 1 (used to
materialize view, derived table, table function or table operator
TblOpInputSpool) (all_amps), which is built locally on the AMPs.
Then we do a SORT to order Spool 1 by the sort key in spool field1.
The size of Spool 1 is estimated with high confidence to be 501
rows (109,218 bytes). The estimated time for this step is 0.09
seconds.
4) We do an all-AMPs RETRIEVE step from Spool 1 (Last Use) by way of
an all-rows scan executing table operator TD_SYSFNLIB.calcmatrix
with a condition of ("(1=1)") into Spool 2 (used to materialize
view, derived table, table function or table operator D1)
(all_amps), which is built locally on the AMPs. The size of Spool
2 is estimated with high confidence to be 501 rows (209,418 bytes).
The estimated time for this step is 0.09 seconds.
5) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way of
an all-rows scan into Spool 3 (used to materialize view, derived
table, table function or table operator TblOpInputSpool)
(all_amps), which is redistributed by hash code to all AMPs. Then
we do a SORT to order Spool 3 by the sort key in spool field1.
The size of Spool 3 is estimated with high confidence to be 501
rows (213,426 bytes). The estimated time for this step is 0.10
seconds.
6) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan executing table operator MWW.cm_solve with a
condition of ("(1=1)") into Spool 4 (used to materialize view,
derived table, table function or table operator D2) (all_amps),
which is built locally on the AMPs. The size of Spool 4 is
estimated with high confidence to be 501 rows (112,725 bytes).
The estimated time for this step is 0.09 seconds.
7) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of
an all-rows scan into Spool 5 (group_amps), which is built locally
on the AMPs. The size of Spool 5 is estimated with high
confidence to be 501 rows (291,582 bytes). The estimated time for
this step is 0.09 seconds.
8) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 5 are sent back to the user as the result of
statement 1. The total estimated time is 0.47 seconds.
DBQL Performance Data
From the explain you can see that CalcMatrix executes in step 4 and CM_Solve executes in step 6. From the DBQL it is obvious the CalcMatrix dominates the elapsed time and the CPU resource consumption. Also the original input 100,000,000 rows are reduced to 2,500,000 (100,000 systems of 25x25 matrices) rows by CalcMatrix and to the final 100,000 row solution coefficients by CM_Solve.
QueryID |
Elapsed |
sl1 |
sl2 |
StepName |
CPUTime |
IOcount |
RowCount |
|
---|---|---|---|---|---|---|---|---|
1 |
307,182,118,593,413,097 |
00:00:00.000000 |
1 |
0 |
MLK |
0.00 |
0.00 |
1.00 |
2 |
307,182,118,593,413,097 |
00:00:00.000000 |
2 |
0 |
MLK |
0.02 |
0.00 |
72.00 |
3 |
307,182,118,593,413,097 |
00:00:04.010000 |
3 |
0 |
RET |
225.90 |
527,522.00 |
100,000,000.00 |
4 |
307,182,118,593,413,097 |
00:00:30.320000 |
4 |
0 |
RET |
1,747.67 |
53,747.00 |
2,500,000.00 |
5 |
307,182,118,593,413,097 |
00:00:00.260000 |
5 |
0 |
RET |
8.24 |
7,213.00 |
2,500,000.00 |
6 |
307,182,118,593,413,097 |
00:00:00.190000 |
6 |
0 |
RET |
11.17 |
3,468.00 |
100,000.00 |
7 |
307,182,118,593,413,097 |
00:00:00.020000 |
7 |
0 |
RET |
0.94 |
1,877.00 |
100,000.00 |
8 |
307,182,118,593,413,097 |
00:00:00.000000 |
8 |
0 |
Edt |
0.12 |
0.00 |
72.00 |
In test case B we solve a single systems of equations with 1,000,000 rows with 999 independent variables and a single dependent variable. The test was performed on the same two node 6750 running Teradata database release 14.10. It should be noted the computation complexity is proportional to (number of variables)2 * number row input rows
Query:
In this case the constant 1 is used as partno for CM_Solve
SELECT * FROM CM_Solve( -- invoke the CM_Solve table operator
ON ( SELECT 1 as partno, d1.* FROM CalcMatrix( -- invoke the CalcMatrix table operator
ON( SELECT * FROM bigcolumns) -- Select all of the columns and rows from the input table
USING PHASE('LOCAL')
) as D1
)
PARTITION BY partno
USING MATRIX_IDS('partno')
CALCTYPE('solve') -- define the calctype
) D2
Explain Text for the query:
1) First, we lock a distinct MWW."pseudo table" for read on a RowHash
to prevent global deadlock for MWW.bigcolumns.
2) Next, we lock MWW.bigcolumns for read.
3) We do an all-AMPs RETRIEVE step from MWW.bigcolumns by way of an
all-rows scan with no residual conditions executing table operator
TD_SYSFNLIB.calcmatrix with a condition of ("(1=1)") into Spool 2
(used to materialize view, derived table, table function or table
operator D1) (all_amps), which is built locally on the AMPs. The
size of Spool 2 is estimated with high confidence to be 49,990
rows (810,537,860 bytes). The estimated time for this step is
0.79 seconds.
4) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way of
an all-rows scan into Spool 3 (used to materialize view, derived
table, table function or table operator TblOpInputSpool)
(all_amps), which is redistributed by hash code to all AMPs. Then
we do a SORT to order Spool 3 by the sort key in spool field1.
The size of Spool 3 is estimated with high confidence to be 49,990
rows (810,987,770 bytes). The estimated time for this step is
3.29 seconds.
5) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan executing table operator MWW.cm_solve with a
condition of ("(1=1)") into Spool 4 (used to materialize view,
derived table, table function or table operator D2) (all_amps),
which is built locally on the AMPs. The size of Spool 4 is
estimated with high confidence to be 49,990 rows (800,939,780
bytes). The estimated time for this step is 0.79 seconds.
6) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of
an all-rows scan into Spool 5 (group_amps), which is built locally
on the AMPs. The size of Spool 5 is estimated with high
confidence to be 49,990 rows (2,200,809,750 bytes). The estimated
time for this step is 0.79 seconds.
7) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 5 are sent back to the user as the result of
statement 1. The total estimated time is 5.67 seconds.
DBQL Performance Data:
From the explain you can see that CalcMatrix executes in step 3 and CM_Solve executes in step 5. From the DBQL it is obvious the CalcMatrix dominates the elapsed time and the CPU resource consumption. Also the original input 1,000,000 rows are reduced to 72,000 (1 systems of 1000x1000 matrix on each of the 72 AMPs) rows by CalcMatrix and to the final 1 row solution coefficients by CM_Solve. Note step 5 will be executed on a single AMP.
QueryID |
Elapsed |
sl1 |
sl2 |
StepName |
CPUTime |
IOcount |
RowCount |
|
---|---|---|---|---|---|---|---|---|
1 |
307,192,118,593,412,145 |
00:00:00.000000 |
1 |
0 |
MLK |
0.00 |
0.00 |
1.00 |
2 |
307,192,118,593,412,145 |
00:00:00.000000 |
2 |
0 |
MLK |
0.02 |
0.00 |
72.00 |
3 |
307,192,118,593,412,145 |
00:01:24.650000 |
3 |
0 |
RET |
5,324.49 |
33,545.00 |
72,000.00 |
4 |
307,192,118,593,412,145 |
00:00:01.690000 |
4 |
0 |
RET |
4.10 |
7,562.00 |
72,000.00 |
5 |
307,192,118,593,412,145 |
00:00:03.550000 |
5 |
0 |
RET |
4.23 |
1,572.00 |
1.00 |
6 |
307,192,118,593,412,145 |
00:00:00.000000 |
6 |
0 |
RET |
0.04 |
236.00 |
1.00 |
7 |
307,192,118,593,412,145 |
00:00:00.000000 |
7 |
0 |
Edt |
0.02 |
0.00 |
72.00 |
See attached file cm_solve_algo.zip.
[2] CalcMatrix: Chapter 26, SQL Functions, Operators, Expressions, and Predicates Release 14.10 B035-1145-112A May 2013