## In database linear regression using the CalcMatrix table operator

Extensibility
Extensibility covers the mechanisms by which you, as the user or developer, can extend the functionality of the Teradata Database, for example with the use of User Defined Functions, or UDFs.

## Linear Regression

In statistics, linear regression is an approach to model the relationship between a scalar dependent variable y and one or more independent variables denoted x. Linear regression is one of the oldest and most fundamental types of analysis in statistics. The British scientist Sir Francis Galton originally developed it in the latter part of the 19th century. The term "regression" derives from the nature of his original study in which he found that the children of both tall and short parents tend to "revert" or "regress" toward average heights.

The case of one independent variable is called simple linear regression. For more than one independent variable, it is called multiple variable linear regressions. This terminology should be distinguished from multivariate linear regression, where multiple correlated dependent variables are predicted rather than a single dependent variable.  Multiple variable linear regression analysis attempts to predict, or estimate, the value of a dependent variable as a linear combination of independent variables, usually with a constant term included. That is, it attempts to find the b-coefficients in the following equation in order to best predict the value of the dependent variable y based on the multiple independent variables x1 to  xn .

y' = b0 + b1x1 + …. + bnxn

where y' denotes the predicted y value.

The best values of the b coefficients are defined to be the values that minimize the sum of squared error values Σ( y – y')2  over all the observations. Note that this requires that the actual value of y be known for each observation, in order to contrast it with the predicted value y' . This technique is called "least-squared errors."

A new style of function has been introduced in Teradata 14.10 named table operators. The table operator feature was originally motived by requirements driven by in database advanced analytics use cases. In addition the Table operator feature enables a simplified in database Map Reduce style programming model.

A table operator can be system defined or user defined. Teradata release 14.10 introduces three new system defined table operators:

•             TD_UNPIVOT which transforms columns into rows based on the syntax of the unpivot expression.

•             CalcMatrix which calculates a Correlation, Covariance or Sums of Squares and Cross Products matrix

The following sections explain using CalcMatrix to perform an in database linear regression.

## Matrix Algebra

The b-coefficient values to minimize the sum of squared errors can be solved using linear algebra. The following describes the technique to use a R client invoking CalcMatrix to perform the linear regression analysis. The discussion also introduces the concept of a cross products matrix and its relatives the covariance matrix and the correlation matrix that are important in multivariate statistical analysis. In order to minimize the sum of squared errors, the equation for the sum of squared errors is expanded using the equation for the estimated y value. This leads to n simultaneous equations in n unknowns, which are commonly referred to as the normal equations:

The equations above have been presented in a way that gives a hint to how they can be solved using matrix algebra, i.e. by first computing the extended Sum-of-Squares-and-Cross-Products (SSCP) matrix for the constant 1 and the variables x1 , x2 and y. By doing this one gets all of the Σ terms in the above normal equations. The CalcMatrix table Operator can be used to build the extended SSCP matrix directly in the Teradata database. See below for details on the extended SSCP matrix.

The linear regression equation can then be rewritten using a matrix algebra representation as

B = (X'X)-1 X'Y

Where

• B -> vector of the b coefficients
• X'X -> X is the normal equation independent variable observations values matrix. X' is the transpose of the X matrix.  X' is multiplied, using matrix multiplication, by X and then inverted as denoted by the -1.
• X'Y -> X' is the transpose of the normal equation independent variable observation values matrix and Y is the normal equation dependent variable observations values matrix.
• Note in the R language %*% is used to represent matrix multiplication.

The elements of the above equation, X'X and X'Y, are contained within the extended SSCP matrix generated by CalcMatrix. Once the extended SSCP matrix is generated we can extract the X'X and X'Y matrices, invert X'X and then matrix multiple it by X'Y .  From a processing perspective the part of performing a linear regression that is performed in database, requires access to the detailed observation data, is the building of the extended SSCP matrix. The remaining matrix algebra is performed on a R "client" machine.  It should be noted that from a compute cost perspective generating the extended SSCP matrix dominates the cost. For example in the case of 1,000,000,000 observations and 100 variables the in database data set size is approximately 800,000,000,000 bytes and the number of floating point calculations is approximately 10 trillion (109 * 1002). On the other hand the resulting client data set size is 80,00 bytes, that is 100 rows of 100 floating point columns are returned to the client.  In general the size of the CalcMatrix result set returned to the client is N rows of N columns where N is the number of variables. CalcMatrix supports up to the Teradata column limit of 2048 variables.

The following simple example will be used to describe the computation steps. Matrix cells in orange represent the independent variables.

Input data set, 6 rows and 3 columns. x1 and x2 are the independent variables and y is the dependent variable.

 x1 x2 y 1 2 5 2 7 14 3 6 15 4 15 20 5 10 25 6 12 30

The 6x3 normal equations Matrix X, including the constant 1 as the first variable b0, would be as follows:

 b0 b1x1 b2x2 Y 1 1 2 5 1 2 7 14 1 3 6 15 1 4 15 20 1 5 10 25 1 6 12 30

The transpose of X, X', would be as follows:

 1 2 3 4 5 6 bO 1 1 1 1 1 1 b1x1 1 2 3 4 5 6 b2x2 2 7 6 15 10 12

The result of the matrix multiplication of X' (3x6) by X (6x3) would be the following 3x3 matrix

 b0 b1x1 b2x2 b0 6 21 52 b1x1 21 91 216 b2x2 52 216 558

The resulting of inverting the matrix X'X would be as follows

 b0 b1x1 b2x2 b0 0.95108445 -0.11213659 -0.04522381 b1x1 -0.11213659 0.14859252 -0.04706968 b2x2 -0.04522381 -0.04706968 0.02422704

The result of the matrix multiplication of X' (3x6) by Y (6x1) would be the following 3x1 matrix

 y b0 109 b1x1 463 b2x2 1108

Finally the result of the matrix multiplication of (X'X)-1 (3x3) by X'Y (3x1) would be the following 3x1 matrix

 Values bo 1.6409783 b1 4.4222427 b2 0.1209045

Substituting the coefficients back into the original equation yields the following regression equation:

y' = 1.6409783 + 4.4222427x1 + 0.1209045x2

The sequence of executing these operations can be encapsulated in an R language function. Before I go through the details on the R language function we first need to explain how X'X and X'Y are represented in the extended SSCP matrix. For the example data set above the resulting extended SSCP matrix is as follows:

 rownum rowname c s x1 x2 y 1 X1 6 21 91 216 463 2 X2 6 52 216 558 1108 3 Y 6 109 463 1108 2371

The c column represents the count of the number of observations (rows). The s column represents the summation of the variable in that row, x1, x2 or y.

X'X can be built by appending a matrix row {c, s[1], s[2]} (6,21,52) to the extended SSCP matrix orange fields.

X'Y can be built by appending a matrix column {s[2]} (109) to the extended SSCP matrix fields, this results in the following matrix's which represent X'X and X'Y

 s x1 x2 Y 6 21 52 109 21 91 216 463 52 216 558 1108

## R Integration

The following R language function "td_lm_cm" encapsulates all of the processing details. The inputs to the function are "indata", a string representing a Teradata table or view of the independent and dependent variables and "Y", a string representing the dependent variable column name. The outputs of the function are the b coefficients matrix. The R function "td_lm_cm" first executes a map/reduce style query invoking CalcMatrix to produce the extended SSCP matrix. The result of CalcMatrix is stored in a R data frame named rdf. The remaining code performs the following logic

1. Extracts the normal equation matrix X'X from the extended SSCP matrix
1. Inverts X'X, using the solve function, to create the matrix iXX
1. Extracts the normal equation matrix X'Y from the extended SSCP matrix
1. Matrix multiples iXX %*% XY to create the coefficient matrix B.
1. Finally adds the dimension names to B and returns B to the caller.
td_lm_cm <- function (indata,  Y)
{
# formulate the query to invoke CalcMatrix in a Map/Reduce model
query <- gettextf( "
SELECT  * FROM CALCMATRIX
(
ON (SELECT SESSION AS ampkey, D1.* FROM CALCMATRIX (
ON (SELECT  *  FROM %s)
USING PHASE('LOCAL')
) AS D1
)
HASH BY ampkey
USING PHASE('COMBINE') CALCTYPE('ESSCP')
)    AS  D2
", indata)

# execute query and create an R data frame of the extended SSCP matrix
tdf <- try(tdQuery(query))
rdf=data.frame(tdf)

#some basic error checking
if (length(names(tdf)) == 0 || nrow(rdf) == 0)
stop("Input table does not exist or contain any rows")
if ( !Y %in% colnames(rdf))
stop("Input table does not contain specified dependent variable column ",Y)

# determine the position of the independent variable columns
xcols <- which(!names(rdf) %in% c('rownum','rowname','c',Y))
#extract partial X'X
pXX <- rdf[ rdf[['rowname']] != Y , c(xcols)]
#extract observation count
obscount <- rdf[ rdf[['rowname']] == Y , c('c')]
#extract X variable summations
Xsum <- rdf[ rdf[['rowname']] != Y , c('s')]
# Build first row of matrix X'X
XX <-data.matrix(obscount)
XX <- cbind(XX, t(data.matrix(Xsum)))
# append partial X'X
XX <- rbind(XX, data.matrix(pXX))
# invert X'X
iXX <-solve(XX)

# extract Y variable summations
Ysum <- rdf[ rdf[['rowname']] == Y , c('s')]
# extract partial X'Y
XY <- rdf[ rdf[['rowname']] != Y , Y]
#build X'Y of matrix
XY <- rbind(data.matrix(Ysum), data.matrix(XY))
# multiply inverted X'X * X'Y to obtain coefficients
B <-  iXX %*% XY

# add dimension names to matrix B
rn <- rdf[ rdf[['rowname']] != Y , c('rowname')]
CL <- c(list("Intercept") , levels(factor(rn)))
dimnames(B) = list(CL, c("Values"))
return(B)
}

## CalcMatrix

CalcMatrix is a table operator and therefore follows the table operator invocation syntax. Specifically for this use case it is invoked as nested queries to implement the Map/Reduce model.  The general example syntax is

SELECT  * FROM CALCMATRIX
(
ON (SELECT SESSION AS ampkey, D1.* FROM CALCMATRIX (
ON (SELECT  *  FROM indata)
USING PHASE('LOCAL')
) AS D1
)
HASH BY ampkey
USING PHASE('COMBINE') CALCTYPE('ESSCP') OUTPUT('COLUMNS')
)    AS  D2

The inner table operator execution which generates derived table D1 is the Map phase of the calculation. It has the following inputs:

• ON -> Any valid Teradata query which produces a set of numeric columns which represent the N independent variables and the one dependent variable
• USING -> PHASE('LOCAL'), the map phase processes the input data that is generated by the ON clause query on each AMP separately and outputs rows of summarized data for that AMP.  The output from the LOCAL phase consists of an INTEGER column named rownum, a VARCHAR(128) column named rowname, a BIGINT column named c (for count), a FLOAT column named s (for sum), and a FLOAT column for each data column of input.
• USING -> The CALCTYPE and OUTPUT using values in the LOCAL phase are always set to 'SSCP' and 'COLUMNS' respectively. Attempts to use different values for these keys are ignored.

The outer table operator execution which generates derived table D2 is the Reduce phase of the calculation. It has the following inputs:

• ON -> Output from inner map phase CalcMatrix invocation.
• HASH BY -> constant value used to redistribute rows to a single AMP. In this example we used the built in function SESSION to generate a distribution key.
• USING -> PHASE('COMBINE').  The input to the COMBINE phase must be exactly the columns that are output from the LOCAL phase.
• USING -> CALCTYPE and OUTPUT control the output columns for the COMBINE phase. If the OUTPUT is set to 'COLUMNS',  default, then the columns output from the 'COMBINE' phase are:  an INTEGER rownum column, a VARCHAR(128) rowname column, a FLOAT column for each data column of input. In addition if the CALCTYPE is 'ESSCP', a BIGINT column named c (for count), a FLOAT column named s (for summation) are append to the output columns.

See the "SQL Functions, Operators, Expressions, and Predicates Release 14.10 B035-1145-112A" chapter 26 for more details on the CalcMatrix syntax.

The following provides some more details on the CALCTYPE using clause value. The CALCTYPE can be one of the following

• ‘SSCP’ = sum-of-squares-and-cross-products matrix (the default)

• ‘ESSCP’= Extended-sum-of-squares-and-cross-products matrix

• ‘CSSCP’ = Corrected-sum-of-squares-and-cross-products matrix

• ‘COV’ = Covariance matrix

• 'COR' = Correlation matrix

The Sums of squares and Cross-Products value of the pair-wise combinations of each column within the selected table. This is calculated as follows, for each pair-wise combination of variables X and Y:

The extended Sums of squares and Cross-Products value of the pair-wise combinations of each column within the selected table. This is the same as the SSCP matrix. In addition the extended matrix contains an additional observation count column, which is a constant for all rows, and an observation summation column.

The Corrected Sums of squares and Cross-Products value of the pair-wise combinations of each column within the selected table. The corrected SSCP matrix differs from the SSCP matrix in that each value is expressed as a deviation from the mean. This is calculated as follows, for each pairwise combination of variables X and Y:

The Covariance value of the pairwise combinations of each column within the selected table. This is calculated as follows, for each pairwise combination of variables X and Y:

The Correlation Matrix analysis allows you to build a Pearson Product-Moment Correlation matrix. A Pearson Product-Moment Correlation value is calculated for each pairwise combination of column within the selected table. This is calculated as follows, for each pairwise combination of variables X and Y:

where n is the total number of occurrences of this variable. It should be noted there is an existing aggregate function CORR which performs the same calculation limited to two variables. To perform a correlation of 100 variables would require 10,000 CORR function invocations which would not be as performant nor could it be syntactically expressed in one statement. The correlation matrix can also be used to solve a linear regression but we will leave the details of that for another discussion.

## R invocation example

The following describes how to use R as a client to execute a multiple variable linear regression.  It documents all of the required R commands and then shows the actual R console interactions as a screen capture. R version 2.11.0 was used running on a windows client platform. The teradataR support library can be obtained from http://downloads.teradata.com/download . The input data is the same as used in the detail example above. The DDL and DML is in the examplecode.txt file attachment.

R Commands:
# connect to a Terdata database
tdConnect(dsn="mwwl1410",uid="mww",pwd="mww")
#create function td_lm_cm, obtain from examplecode.txt file attachment.
#use td_lm_cm (tablename, dependent variable name)
td_lm_cm("testdata","Y")
# lets use built in R lm function for comparisons
tdfr <- td.data.frame("testdata");
#convert  to a r data frame
mydata <- as.data.frame.td.data.frame(tdfr, size=10000)
# display the data
mydata
#run the R linear regression function
lm(y ~ x1 + x2,data=mydata)

The following R commands can be used to create a 3 dimensional plot of the data and regression plane for the solved equation. "scatterplot3d" was installed from a CRAN mirror site.

library(scatterplot3d)
x <- mydata\$x1
y <- mydata\$x2
z <- mydata\$y
s3d <- scatterplot3d(x, y,z, highlight.3d=TRUE,  col.axis="blue", main="MultiVariable Regression", xlab="X1", ylab="X2",zlab="Y",axis=TRUE, pch=16,type="h",box=FALSE,angle=60)
fit <- lm(z ~x+y)
s3d\$plane3d(fit, col="Black")

Tags (1)