Building Data Science Pipelines with R and Aster. Part 2: Environment and Data Load

Learn Data Science
Not applicable

Table of contents

Previously in this series

In my first post I explained driving force (value), defined goals (repeatable, consistent, testable, deployable, robust, scalable and functional process) and outlined structure (workflow) behind data science pipelines. In this post we'll begin examining data science pipelines using concrete examples with Teradata Aster by building workflow in R. Here I focus on connecting to data store and analytical engine (Aster), loading data, and test-driven development of workflow steps to achieve production grade code.

Teradata Aster R Philosophy

Before we dive into the details let me clarify general approach underlying R development with Aster R. Any aspiring data scientist practicing R can immediately take advantage of Aster powerful data store with RODBC and client-server style of programming. But having said that we should realize that such approach would require:

  1. learning and practicing SQL with its Aster extensions like SQL/MR and SQL/GR;
  2. or/and moving data from Aster to R memory to utilize R functions and programming

With Aster R Teradata aims at overcoming these by introducing R functions that look and behave similarly to R but act on and execute inside Aster data store by translating each call to native SQL and its extensions. Such functions are designated with prefixes ta. and aa. by Teradata. But it went even further by expanding R apply-functionality with functions like ta.apply() and ta.tapply() to push and execute R code inside Aster nodes taking full advantage of its capabilities but demanding additional configuration and advanced skills.

Teradata Aster R Environment

R programming with Aster made possible with TeradataAsterR package which is unlike other R packages available from CRAN is downloaded directly from Teradata here. Hardly surprising it requires ODBC driver for Teradata Aster installed (also available from Teradata here as part of Client Tools for your platform). For the record, installing TeradataAsterR will also install its dependency - RODBC package - that maintains data access layer with Aster database. The following diagram illustrates both the data science pipeline execution environment and the Teradata Aster database:

Assuming we already configured ODBC datasource Aster_ODBC_dsn that points to Aster instance with database my_Aster_db initializing and connecting to Aster draws 2 lines of code:

library(TeradataAsterR)

ta.connect(dsn="Aster_ODBC_dsn", database="my_Aster_db")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

From this point on TeradataAsterR implicitly maintains connection reviving and reconnecting to the database as necessary. For that reason no explicit connection object is created even though ta.connect does return one to hold on to. Upon success Aster R reports basic information about established connection:

RODBC Connection 1
Details:
   case=nochange
   DSN=Aster_ODBC_dsn
   DATABASE=my_Aster_db

Loading Data

With Aster R data stored in the Aster Database is accessed using the concept of virtual data frames (primarily as there are other types of virtual objects in Aster R that we won't discuss here). The virtual data frame concept's goal is transparent handling of data stored in Aster just as it were in R memory using Aster R functions.

In our examples we'll use various baseball datasets from the the excellent Lahman package. Our first task will be loading the datasets into Aster database, for example Master data frame with all baseball players:

library(Lahman)

ta.create(Master, table = "master", schemaName = "baseball",
          tableType = "fact", partitionKey = "playerID")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

This example uses the function ta.create() to create a table named master in the baseball schema of the Aster Database. Lahman package contains more datasets we'll use later so we added new function createBaseballData() to save us some typing:

createBaseballData <- function(data, table, schema, 
                               partitionKey, analyzeFlag) {
  t = proc.time()
  ta.dropTable(tableName = table, schemaName = schema)
  ta.create(data, table = table, schemaName = schema,
            tableType = "fact",
            partitionKey = partitionKey,
            row.names = TRUE, analyze = analyzeFlag)
  runtime = proc.time() - t
}

# Master table
createBaseballData(Master, "master", "public", "playerID", TRUE)

# Batting table
createBaseballData(Batting, "batting", "public", "playerID", TRUE)

# Fielding table
createBaseballData(Fielding, "fielding", "public", "playerID", TRUE)
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Note, that createBaseballData() is more involved: it records time it took to create each table, it makes sure the table didn't exist before, and it runs SQL ANALYZE command at the end if desired. Finally, it creates implicit row name column which would be a critical factor to let Aster R work with virtual data frames. With R we take advantage of its rich programming features and environment to create functional, robust and testable code as part of the overall workflow.

Worth mentioning the rest of functions to load data into the Aster Database:

  • ta.push() to load data from R into an existing Aster virtual data frame
  • ta.load.csv() to load data from a .csv file to an Aster virtual data frame
  • ta.reify() to materialize virtual data frame into temporary schema

Test, Test, Test

How to assure that ta.create() performed as intended that is it created exact copy of the Lahman data set in the Aster Database? The same way any program should - by running a test. For example, by comparing dimensions of the original data frame with newly created table:

# R and Lahman package
dim(Batting); dim(Master)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

[1] 101332     22

[1] 18846    26

# Aster R
batting.ta = ta.data.frame("batting")
master.ta = ta.data.frame("master")

ta.dim(batting.ta); ta.dim(master.ta)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

[1] 101332     23

[1] 18846    27

ta.data.frame() is our principal gateway to Aster R goodies - it creates a virtual data frame from an existing table, view or a SELECT query in the Aster Database - in this case we referenced tables batting and master. ta.dim() is to Aster R virtual data frames what dim() to R data frames. Each database table contains one additional column row_name created by Aster R to maintain functionality compatible with R data frames when using other data frame-like functions: 

  • ta.head()
  • ta.tail()
  • ta.show()
  • ta.length()
  • ta.names()
  • ta.ncol()
  • ta.nrow()
  • ta.colnames()
  • ta.dimnames()
  • ta.merge()
  • ta.cbind()
  • ta.rbind()
  • ta.order()
  • and more...

Our test compares dimensions between a data source (one of Lahman data sets) and its counter part table in Aster. Such rudimentary test may work ok while working on a prototype or unimportant tasks by minimizing network traffic between R and Aster. But it is less than inadequate for the production environment we target with the pipelines. To make workflow more bullet-proof one way is to bring data back from Aster into R, transform to a data frame and compare with the original: 

compareDataInRandAster <- function(df, ta, key, columns=names(df)) {

  columns = setdiff(unique(c(key, columns)), c("lgID","teamID","bats","throws"))

  data_in_R = df[, columns] %>% arrange_(.dots=setNames(key,key))

  data_from_Aster = as.data.frame(ta[ , columns],
    stringsAsFactors = FALSE) %>% arrange_(.dots=setNames(key,key))

  all.equal(data_in_R, data_from_Aster)
}

compareDataInRandAster(Batting, batting.ta,
  key = c("playerID", "yearID", "stint"))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The function compareDataInRandAster() is near production quality test that validates any Lahman data set against its image loaded into Aster table. It utilizes:

  • package dplyr to order data (for convenience, we could use standard order() instead, of course) 
  • Aster R as.data.frame() to convert virtual data frame to R data frame by moving data from Aster into R memory
  • all.equal() to validate that original data set and its image brought back from the database are exactly the same.

With R programming language building tests (using package like testthat) should be an integral part of your development process and with data science pipelines we seize this opportunity. In particular, replace all.equal() with expect_equal() to make compareDataInRandAster() compatible with the rest of the testthat suite.

In the next post we will review data manipulation techniques available with Aster R as part of data science pipeline and compare them with equivalent functions from dplyr package.

Resources

Image source: Red Hat Developer Program