R-in-Aster: Basic Prototyping and Script Conversion

Learn Aster
Teradata Employee

R for Big Data:  Integrating R with Teradata Aster


Basic Prototyping and Script Conversion

Intro

The process of converting an existing R script or creating a script from scratch is pretty easy.  The main requirement is an awareness of R Input / Output (I/O) techniques including database connection databases.  Or, if like me you skipped over the chapter on R I/O you can just copy the boilerplate R I/O code structures and paste them into your code.  This demo will focus on pasting in the boilerplate structures and tweaking a few parameters to get you up and running with a minimum of focus on the I/O structures themselves.

Caveat

The discussion in this particular document involves a simple case where the full data set is read at the beginning of the process and the full data set is exported at the end of the process.  Different structures will be used for larger data sets.  Additionally, there are additional structures for importing non-data frame objects or exporting non-data frame results.  These topics will be covered in subsequent blogs.

Three (or four) basic types of R-in-Aster scripts

I tend to think of R-in-Aster scripting as consisting of three basic types of script.

  • Prototype script
  • Stream script
  • Management script

There is one additional type of script I keep, a set of boiler plate code structures that I use as a resource to copy from in order to either build the prototype or to convert the prototype into a stream script ready to load into Aster.

  • The Copy of Useful R Structures Script

The Prototype script

The Prototype script is where you test out the data that you have loaded into Aster, develop your primary R "function", and test that you can export the results as a data frame.  The prototype might be a script that you already have on hand or it might be where you begin to code from scratch.

Both the prototype and the subsequent stream script consist of three main components:

  • Import / Input Process
  • Function
  • Export /Output Process

The primary input for an R script running in Aster will be an Aster table, the results of a previous Aster function, or another data source on the network such as Hadoop or a data warehouse.  All three types of input will act as a SQL table and will be imported into R as a data frame.   The prototype import process will establish an ODBC connection to Aster using the RODBC package and will use a SQL query to read the data source.

##############################################################

###   1 Prototype Structure: Import from Table via ODBC

##############################################################

### Load RODBC and establish database connection

library(RODBC)  # RODBC for database connections

odbcCloseAll()

aster <- odbcConnect(dsn = "aster", uid = "XXXXXX",

                     pwd = "YYYYYY", believeNRows=FALSE)

### Load data from table

inputData <- sqlQuery(aster, "

SELECT price, carat, cut, clarity, color FROM data_source;

" , stringsAsFactors=TRUE)

The prototype function should do several things. First it should read the "inputData" data frame object created in the import process.  Second, it should assign new column names to the data. Third, it should execute the desired function.  Fourth, it should pack the results into a data frame for export, in this case an object called outputData.

During the prototyping process the user should confirm that the data formats are correct for the importation. One example is the decision whether use stringsAsFactor set equal to TRUE or FALSE.  This is one parameter which should be reviewed later when pasting in the stream code.

##############################################################

###   2 Prototype Structure: Establish function

##############################################################

# Establish column names

price   <- inputData[,1]

carat   <- inputData[,2]

cut     <- ordered(inputData[,3], levels = c('Fair', 'Good', 'Very Good', 'Premium', 'Ideal'))

clarity <- ordered(inputData[,4], levels = c( 'I1' , 'SI2' , 'SI1' , 'VS2'  , 'VS1' , 'VVS2' , 'VVS1' , 'IF'))

color   <- ordered(inputData[,5], levels = c('D' , 'E' , 'F' , 'G' , 'H' ,  'I' , 'J'))

## Execute function

regr <- lm(price ~ carat + cut + clarity + color)

## Pack results as outputData dataframe

outputData <- cbind ( data.frame(price), data.frame(regr$fitted.values),
data.frame(regr$residuals))

The prototype export process is responsible for two things. First, this is where users will review the data structure to be exported using the R str() function.  Second, this is where we verify that the data can be exported using the write.table() function. The data for export should have already been packed into a data frame as the last step in the function creation process.

  • The column names will not be exported in the stream process, so do not export them in the prototype process.  Use row.names=FALSE
  • Use str() function to discover and note the R data types in the results.  Knowing the R data types will be necessary during the creation of the management script when the data types for the output are declared.
  • To use the write.table() function you will need to update the code with a valid file path. Please note that in R the slashes are forward slashes "/" and not the normal backward slash "\" that Windows uses for file paths.
  • Export with the parameter na="".  This will prevent R from exporting nulls or empty cells as "NA" which Aster would read as text data.

##############################################################

###  3 Prototype Structure: Export to Text File

##############################################################

### Check output structure for later conversion to SQL data types

str(outputData)

### Output to file or to screen

  1. write.table(outputData, file =   
    "C:/Users/RF/Documents/Aster/R/prototype/proto_test.txt",

   row.names=FALSE, col.names=FALSE, na="" , sep="\t" )

The Stream Script

If the prototype is working well then the conversion to a functional stream script is trivial. First, copy the boilerplate code for the stream import process and paste it over the prototype import code. Second, copy the boilerplate code for the stream export process and paste it over the prototype export code.

If the object naming is consistent between the prototype code and the stream code then only a basic review of the import and export parameters is necessary.

Note: If you used stringsAsFactors one way in your prototype then make sure you have it set the same way in the Stream version.


##############################################################

###   1 Stream Structure: Import from STDIN

##############################################################

stdin <- file(description="stdin",open="r")

inputData <- try(read.table(file('stdin'), header = FALSE, stringsAsFactor=TRUE, comment.char = "",

               sep = '\t', quote="", nrows= -1 ),  silent=TRUE  )

##############################################################

###   3 Stream Structure: Export STDOUT

##############################################################

write.table(outputData,stdout(), col.names=FALSE, row.names=FALSE, quote=FALSE , sep="\t")

Subsequent Blogs

Subsequent blogs will cover more complex cases and the following topics:

  • Management scripts
  • Partitioning and map-reduce
  • R data type to SQL data type conversion
  • Data chunking structures
    • While(True){}
    • End-of-data management
    • nrows = 1
  • Parameterization
    • commandArgs()
    • getopt()
  • Importing non-data frame objects
    • serialize
    • unserialize
  • Exporting non-data frame objects
  • NA versus NULL