Predictive Analytics with Teradata AsterR

Learn Data Science
N/A

Summary

In this post I will show an example of an analysis process using the new Teradata AsterR package. For more information about what Teradata AsterR is and how it works, please refer to this excellent post from Roger Fried: http://aster-community.teradata.com/community/learn-aster/blog/2015/06/18/how-asterr-is-used-in-the-...


Information about the data set used in this analysis is located here: http://groupware.les.inf.puc-rio.br/har


The main idea behind that data was to collect data from 6 different individuals and then use that to try to predict how well barbell lift exercises were performed. Each individual had accelerometers on the belt, forearm, arm, and dumbbell when performing the exercises, and were asked to perform the exercises several times in 5 different ways. Each of those different ways is a representative of how well the exercise is performed.


In this process we use Teradata AsterR, from a local R client, to connect to a remote Teradata Aster cluster and perform some data exploration and predictive analysis on that data set. During the analysis we will also show some aspects of the familiar R interface provided by AsterR, through basic examples:



Loading and Examining the Data

Here we load the data on a local R Data Frame. This is for demonstration purposes only. Usually, data sets in Aster are too large to fit into the local memory of an R client and would have been already loaded into Aster using Aster’s native data loading capabilities:


data <- read.csv('pml-training.csv', stringsAsFactors = F)


Now we load the AsterR package and connect to the remote Aster environment to begin working. The connection is performed through RODBC. Here we assume there is an ODBC identifier already defined in the local OS with the alias 'Aster':

library(TeradataAsterR)

ta.connect('Aster')


We will be using a series of AsterR functions to perform the analysis, trying to minimize the materialization of local R objects. That is, data will be represented as AsterR Virtual Data Frames and we will operate on that using Aster native analytical functions. Those Aster functions are wrapped with an R interface by the AsterR package. We also run some R functions using the R Runners from the AsterR package that are executed in-database on Aster.


If we don’t need to persist the table on Aster, we can coerce a Virtual Data Frame directly from a local Data Frame. In this case, a temporary table will be created on Aster. We also see that the local R object created is of class 'ta.data.frame', which represents the AsterR Virtual Data Frame:


ta_data <- as.ta.data.frame(data)

class(ta_data)

[1] "ta.data.frame"


If we are working with a table that already exists in Aster, lets say a table named 'har_data' in the schema 'har', we can also create a Virtual Data Frame from it:


ta_data <- ta.data.frame('har_data', schemaName = 'har')

Now we execute some basic AsterR functions, to show the familiarity with basic data manipulation in the native R environment.


Show the dimensions of the Virtual Data Frame (19622 rows and 160 columns):


ta.dim(ta_data)


[1] 19622   160

Show the column names of the Virtual Data Frame (only the first 16 columns):


ta.names(ta_data)[1:16]

[1]  "X"                   "user_name"         "raw_timestamp_part_1" "raw_timestamp_part_2"

[5]  "cvtd_timestamp"      "new_window"        "num_window"           "roll_belt"          

[9]  "pitch_belt"          "yaw_belt"          "total_accel_belt"     "kurtosis_roll_belt" 

[13]  "kurtosis_picth_belt" "kurtosis_yaw_belt" "skewness_roll_belt"   "skewness_roll_belt.1"


Accessing rows and columns of the Virtual Data Frame in the R way (rows 1 to 5 and columns 8 to 11):


ta_data[1:5, 8:11]


  roll_belt pitch_belt yaw_belt total_accel_belt

1      1.41       8.07    -94.4                3

2      1.41       8.07    -94.4                3

3      1.42       8.07    -94.4                3

4      1.48       8.05    -94.4                3

5      1.48       8.07    -94.4                3


Compute summaries for the Virtual Data Frame columns (example for columns 8 to 11):


ta.summary(ta_data[, 8:11])

  roll_belt       pitch_belt           yaw_belt      total_accel_belt

Min.   :-28.90   Min.   :-55.8000   Min.   :-180.00   Min.   : 0.00  

1st Qu.:  1.10   1st Qu.:  1.7600   1st Qu.: -88.30   1st Qu.: 3.00  

Median :113.00   Median :  5.2800   Median : -13.00   Median :17.00  

Mean   : 64.41   Mean   :  0.3053   Mean   : -11.21   Mean   :11.31  

3rd Qu.:123.00   3rd Qu.: 14.9000   3rd Qu.:  12.90   3rd Qu.:18.00  

Max.   :162.00   Max.   : 60.3000   Max.   : 179.00   Max.   :29.00

Preparing the Data

Now we show the Split-Apply-Combine paradigm in AsterR, which allows the execution of arbitrary R code on the Virtual Data Frames, following the MapReduce model. This is accomplished by the Aster R functions like ta.apply and ta.tapply, which mimic the native R functions apply and tapply, respectively. The difference is that the R code is executed in-database on Aster, leveraging Aster’s parallel execution.

We will define a local R function to count the occurrences of NA values. Then we apply it in all columns of the Virtual Data Frame and plot the results:


# function to count NAs in a vector

count_na <- function(x) {

  length(which(is.na(x)))

}

# apply the function to the columns of the Virtual Data Frame and return the results in a vector

na_counts <- ta.apply(ta_data,

                      MARGIN = 2,

                      FUN = count_na,

                      FUN.upload = 'script',

                      FUN.result =  'vector')

# plot the results

plot(na_counts)

Rplot1.png

We also do something similar to count the occurrences of empty strings. Notice that in this case we assume the default values for parameters 'FUN.upload' and 'FUN.result':


count_es <- function(x) {

  length(which(x==''))

}

es_counts <- ta.apply(ta_data, MARGIN = 2, FUN = count_es)

plot(es_counts)

Rplot2.png

From the plots above, we notice that the columns that have ‘NA’ or empty strings, have the same counts and that count seems to be almost the totality of data samples. Therefore we remove those variables from the data set. We also remove some initial columns that we know to be only metadata information. We then end up with 1 identification column ('X'), 52 predictor columns, and 1 class column ('classe'):

pred_names <- intersect(names(which(ta_na_counts==0)), names(which(ta_es_counts==0)))

pred_names <- pred_names[c(1,8:length(pred_names))]

pred_names

[1]  "X"               "roll_belt"       "pitch_belt"      "yaw_belt"     
[5]  "total_accel_belt""gyros_belt_x"    "gyros_belt_y"    "gyros_belt_z" 
[9]  "accel_belt_x"    "accel_belt_y"    "accel_belt_z"    "magnet_belt_x"
[13] "magnet_belt_y"   "magnet_belt_z"   "roll_arm"        "pitch_arm"    
[17] "yaw_arm"         "total_accel_arm" "gyros_arm_x"     "gyros_arm_y"  
[21] "gyros_arm_z"     "accel_arm_x"     "accel_arm_y"     "accel_arm_z"  
[25] "magnet_arm_x"    "magnet_arm_y"    "magnet_arm_z"    "roll_dumbbell"
[29] "pitch_dumbbell"  "yaw_dumbbell"    "total_accel_dumbbell"   "gyros_dumbbell_x"
[33] "gyros_dumbbell_y""gyros_dumbbell_z""accel_dumbbell_x""accel_dumbbell_y"  
[37] "accel_dumbbell_z""magnet_dumbbell_x""magnet_dumbbell_y""magnet_dumbbell_z" 
[41] "roll_forearm"    "pitch_forearm"   "yaw_forearm"     "total_accel_forearm"
[45] "gyros_forearm_x" "gyros_forearm_y" "gyros_forearm_z" "accel_forearm_x"
[49] "accel_forearm_y" "accel_forearm_z" "magnet_forearm_x""magnet_forearm_y"  
[53] "magnet_forearm_z""classe"       

We now subset the Virtual Data Frame to keep only the columns of interest for the analysis:

ta_data <- ta_data[pred_names]

ta.dim(ta_data)


[1] 19622    54

We create an index vector from the id column that represents a sample of 80% of the original data:

id <- ta_data$X

id_train <- ta.sample(id, sample_fraction = 0.8)

id_train <- as.vector(id_train)

And we create new Virtual Data Frames for training and testing, with training being 80% of the available data, and the remaining 20% for testing. Note that AsterR allows filtering the data in the standard R way, although the actual processing will run on Aster as a SQL query:

ta_data_train <- ta.subset(ta_data, id %in% id_train)

ta_data_test <- ta.subset(ta_data, !id %in% id_train)

ta.dim(ta_data_train)


[1] 15631    54


ta.dim(ta_data_test)


[1] 3991   54

Train and Validate a Predictive Model

Now we will run a classification task using Random Forest. We will train a model using the training data created and assess its performance on the test data. That is, in the context of this analysis we will see how well the model is capable of predicting the correct class (that represents how well an exercise was performed) for each record in the test data, using the measurements from the accelerometers.


The AsterR package wraps the native Random Forest functions in Aster through local R functions that are more familiar for R users. In the case of Random Forest, as for other classifier functions, we define an R formula that tells the model what are the outcome and predictor variables:

fml <- paste0(pred_names[2:(length(pred_names)-1)], '+', collapse='')

fml <- strtrim(fml, nchar(fml)-1)

fml <- paste0(pred_names[length(pred_names)], '~', fml)

fml

[1] "classe~roll_belt+pitch_belt+yaw_belt+total_accel_belt+gyros_belt_x+gyros_belt_y+gyros_belt_z

+accel_belt_x+accel_belt_y+accel_belt_z+magnet_belt_x+magnet_belt_y+magnet_belt_z+roll_arm+pitch_arm

+yaw_arm+total_accel_arm+gyros_arm_x+gyros_arm_y+gyros_arm_z+accel_arm_x+accel_arm_y+accel_arm_z

+magnet_arm_x+magnet_arm_y+magnet_arm_z+roll_dumbbell+pitch_dumbbell+yaw_dumbbell+total_accel_dumbbell

+gyros_dumbbell_x+gyros_dumbbell_y+gyros_dumbbell_z+accel_dumbbell_x+accel_dumbbell_y+accel_dumbbell_z

+magnet_dumbbell_x+magnet_dumbbell_y+magnet_dumbbell_z+roll_forearm+pitch_forearm+yaw_forearm

+total_accel_forearm+gyros_forearm_x+gyros_forearm_y+gyros_forearm_z+accel_forearm_x+accel_forearm_y

+accel_forearm_z+magnet_forearm_x+magnet_forearm_y+magnet_forearm_z"

We finally train a Random Forest model in Aster:

ta_forest_model <- ta.forest(formula = fml,

                             data = ta_data_train,

                             type = 'classification',

                             ntree = 100,

                             maxdepth = 50)

We now run the predict function on Aster to score our model on the test data:

ta_forest_pred <- ta.forest.predict(object = ta_forest_model,

                                    newdata = ta_data_test,

                                    id.col = 'X')

Now we get the vectors for expected classes (from the test data) and predicted classes (from the model), so that we can asses the model performance:


expected <- as.vector(ta_data_test$classe)

predicted <- as.character(ta_forest_pred$retval$prediction)

With the expected and predicted values, we create a Virtual Data Frame for use with the Confusion Matrix function in Aster:

ta_cm <- as.ta.data.frame(data.frame(expected, predicted))

We then compute the confusion matrix in Aster and print it:

cm <- ta.confusionMatrix(ta_cm, expected.column = 'expected', predict.column = 'predicted')

print(cm)

$model

   expect predict count    fraction

1       D       D   700 0.988700565

2       D       E     5 0.007062147

3       D       B     1 0.001412429

4       D       C     2 0.002824859

5       A       D     2 0.001774623

6       A       A  1118 0.992014197

7       A       B     4 0.003549246

8       A       C     3 0.002661934

9       B       D     4 0.005354752

10      B       A    11 0.014725569

11      B       B   725 0.970548862

12      B       C     7 0.009370817

13      E       D     4 0.005471956

14      E       E   721 0.986320109

15      E       B     4 0.005471956

16      E       C     2 0.002735978

17      C       D     5 0.007374631

18      C       E     1 0.001474926

19      C       B     5 0.007374631

20      C       C   667 0.983775811

If one wants a confusion matrix in the R way, plus other useful statistical metrics about the model performance, the 'caret' package is a good choice:

library(caret)

confusionMatrix(table(data.frame(expected, predicted)))


Confusion Matrix and Statistics

        predicted

expected    A    B    C    D    E

       A 1118    4    3    2    0

       B   11  725    7    4    0

       C    0    5  667    5    1

       D    0    1    2  700    5

       E    0    4    2    4  721

Overall Statistics

                                         

               Accuracy : 0.985          

                 95% CI : (0.9807, 0.9885)

    No Information Rate : 0.2829         

    P-Value [Acc > NIR] : < 2.2e-16      

                                         

                  Kappa : 0.981          

Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E

Sensitivity            0.9903   0.9811   0.9794   0.9790   0.9917

Specificity            0.9969   0.9932   0.9967   0.9976   0.9969

Pos Pred Value         0.9920   0.9705   0.9838   0.9887   0.9863

Neg Pred Value         0.9962   0.9957   0.9958   0.9954   0.9982

Prevalence             0.2829   0.1852   0.1706   0.1792   0.1822

Detection Rate         0.2801   0.1817   0.1671   0.1754   0.1807

Detection Prevalence   0.2824   0.1872   0.1699   0.1774   0.1832

Balanced Accuracy      0.9936   0.9871   0.9881   0.9883   0.9943