Learn Data Science

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Teradata
- :
- Data Science Blog Posts
- :
- Learn Data Science
- :
- Running independent jobs in parallel on Aster with...

04-08-2016
09:50 AM

- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content

04-08-2016
09:50 AM

Hardly a surprise to anyone Teradata Aster runs each SQL, SQL-MR, and SQL-GR command in parallel on many clusters and distributed data. But when faced with the task of running several independent jobs at once we have to do extra work to parallelize them in Aster. When running a SQL script the next job has to wait for the previous to finish. This makes sense where results of the previous job are passed down to next one, but it does not when the jobs are independent. For example, cross-validation of linear regression or other models is divided into independent jobs each working with its respective partition (of total *K* in case of K-fold cross-validation). These jobs could run in parallel in Aster with little help from R. This post will illustrate how to run ** K** linear regression models

Cross-validation is important technique in machine learning that receives its own chapters in the textbooks (e.g. see Chapter 7 here). In our examples we implement a *K*-fold cross-validation method to demonstrate how to run parallel jobs in Aster with R. The implementation of K-fold cross-validation that will be given is neither exhaustive nor exemplary as it introduces certain bias (based on month of the year) into the models. But this approach could definitely lead to a general solution for cross-validation and other problems involving execution of many similar but independent tasks on Aster platform.

Further more, the examples will be concerned only with the step in *K*-fold cross-validation that creates * K* models on overlapping but different partitions of the training dataset. We will show how to construct

We will use Dallas Open Data data set available from here (including Aster load scripts).

To simplify examples we will also use R package **toaster** for Aster and several other packages - all available from CRAN:

install.packages("toaster", "parallel", "iterators", "doParallel", "foreach")

Dallas Open Data has information on building permits across city of Dallas for the period between January 2012 through May 2014 stored in the table dallasbuildingpermits. We can quickly analyze this table from R with toaster and see its numerical columns:

library(toaster)

close(conn)

conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};server=10.xx.xx.xx;port=2406;database=dallas;uid=beehive;pwd=beehive",

interpretDot=TRUE)

dallasPermitsTableInfo = getTableSummary(conn, "dallasbuildingpermits")

getNumericColumns(dallasPermitsTableInfo)

which results in:

[1] "area" "value" "lon" "lat"

These 4 fields will make up our simple linear model to determine the value of construction using its area and location. And now the same in R terms:

value ~ area + lon + lat

This problem is not beyond R memory limits but our goal is to execute linear regression in Aster. We enlist **toaster**'s *computeLm* function that returns R *lm* object:

fit.all = computeLm(conn, "dallasbuildingpermits", value ~ area + lon + lat, tableInfo = dallasPermitsTableInfo)

Lastly, we need to define the folds (partitions) on the table to build linear regression model on each of them. Usually, this step performs equal and random division into partitions. Doing this with R and Aster is actually not extremely difficult but will take us beyond the scope of the main topic. For this reason alone we propose **quick and dirty** method of dividing building permits into 12 partitions (** K=12**) using issue date's month value (in SQL):

SELECT DATE_PART('month',issued)::int fold, COUNT(*) fold_size

FROM public.dallasbuildingpermits

GROUP BY 1

ORDER BY 1

Again, do not replicate this method in real cross-validation task but use it as a template or a prototype only.

To make each fold's compliment (used to train 12 models later) we simply exclude each month's data, e.g. selecting the compliment to the fold 6 in its entirety (in SQL):

SELECT COUNT(*)

FROM public.dallasbuildingpermits

WHERE DATE_PART('month',issued)::int != 6

Before we get to parallel execution with R we show how to script in R Aster cross-validation of linear regression. To begin we use standard R * for* loop and

fit.folds = list()

for(fold in 1:12) {

fit.folds[[fold]] = computeLm(conn, "dallasbuildingpermits", value ~ area + lon + lat, tableInfo = dallasPermitsTableInfo,

where = paste("date_part('month',issued)::int !=", fold))

}

This results in the list * fit.folds* that contains 12 linear regression models for each fold respectively.

Next, we replace the * for *loop with the specialized

library(foreach)

# running cross-validation sequantially with foreach function

fit.folds = foreach(fold = 1:12, .packages=c('RODBC', 'toaster')) %do% {

fitted.model = computeLm(conn, "dallasbuildingpermits", value ~ area + lon + lat, tableInfo = dallasPermitsTableInfo,

where = paste("date_part('month',issued)::int != ", fold))

}

*foreach* performs the same iterations from 1 to 12 as * for *loop and combines results into list by default.

Finally, we are ready to enable parallel execution in R. For more details on using package **doParallel** see here, but the following suffices to enable a parallel backend in R on Windows:

library(doParallel)

cl = makeCluster(6)

registerDoParallel(cl, cores=6)

After that *foreach* with operator *%dopar%* automatically recognizes parallel backend ** cl **and runs its iterations in parallel:

fit.folds = foreach(fold = 1:12, .packages=c('RODBC','toaster')) %dopar% {

parConn = odbcReConnect(conn)

fitted.model = computeLm(parConn, "dallasbuildingpermits", value ~ area + lon + lat, tableInfo = dallasPermitsTableInfo,

where = paste("date_part('month',issued)::int != ", fold))

close(parConn)

fitted.model

}

Comparing with *foreach* *%do%* earlier notice extra handling for ODBC connection inside *foreach %dopar%*. This is necessary due to inability of sharing the same database connection between parallel processes (or threads, depending on the backend implementation). Effectively, with each loop we reconnect to Aster with a brand new connection by reusing original connection's configuration in function *odbcReConnect*.

Lastly, let's see if the whole thing was worth the effort. Chart below contains elapsed times (in seconds) for all 3 types of loops: * for* loop in R,

You are welcome! :-)

Labels:

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.