How AsterR is used in the “Data Discovery Process?"

Learn Data Science
Teradata Employee

How AsterR is used in the “Data Discovery Process?

What is AsterR?

AsterR is a Teradata produced package installed within the R client application.  This package is distinct from, but complements, the installation of R within Aster.  Together the AsterR package and the R installation into Aster create a rich environment that provides the R user with the normal look and feel of R while maintaining the power and speed of Aster.  There is a great deal of new functionality in AsterR that duplicates standard R functions while carrying out the operations and data storage within Aster.  All the Aster analytic functions may be executed from R using SQL, but many of the functions such as nPath, cFilter, Minhash, and Random Forest have been "translated" into a pure R look and feel.  In addition, AsterR provides pathways called the "R Runners" to push R code into Aster for execution.

How is AsterR Used? :  Database Integration

At its most trivial level, AsterR provides R with database integration.  Simple DB integration is something that R users are trying to achieve in a variety of ways because it begins to address some of R's most important weaknesses.  Typically data is passed into and out of R via flat files, ODBC integration is awkward, its file system is not open, and it even struggles with Excel files.  R also suffers from important limitations in the size of the data that it can manage and simple DB integration provides workarounds.

AsterR establishes a connection to Aster that is based on both ODBC for small data exchanges and mule copy for large data exchanges.  Using Aster as a simple database provides enterprise quality security for accessing and subsequently landing data and Aster is a component of the larger Teradata Unified Data Architecture so data can easily be sourced from Hadoop, an EDW, or other enterprise systems.

Example

Create Connection to Aster


AsterR <- ta.connect("aster", uid = "bob",     pwd = "open_seseme",
database = "r", dType = "odbc")

Create an AsterR Virtual Data Frame from an Aster Table or a View to Hadoop or Teradata EDW


salaries <- ta.data.frame("salaries", schemaName = "baseball")


The AsterR functions listed below establish and control the simple database relationships between R and Aster.


How is AsterR Used? :  Data Exploration and Simple Analysis

Simple database integration solves some of R's problems, but it does not address the main problem.  R simply does not scale well.  On its own, R is not a big data tool.

AsterR makes data exploration with the full dataset easy. With AsterR the R user explores the whole table with typical R-like functions, but the data and any calculations remain on Aster, not the limited R client. Where possible (under the hood) some of these processes involve MPP processing, but what matters is that they deliver results on a “big data” dataset.

The AsterR functions listed below explore the data and to do simple statistical analysis.

Example

Correlation Analysis Optimized for Map-Reduce Execution


ta.cor( ta_frame = salaries,

        column_pairs = 'yearid:salary',

        print_query = FALSE)

How is AsterR Used? :  Data Transformation

The next step for most users is to restructure the data into the right formats for the more interesting analytic processes.  This data transformation is quite easy and common in R, but as the data size grows these operations can quickly get bogged down.

In AsterR, merging billion row tables, adding new columns, transposing the data, and recoding the information becomes extremely easy and fast.

How is AsterR Used? :  Data Mining

Now we come to the higher level analysis. Here we typically want to describe, classify, or cluster the data. AsterR’s functions allow us the directly apply many of the most useful methods directly on the data while it is in Aster.

The data mining functions are native Aster SQL-MR functions, but they have been repacked into a pure R look and feel for easy execution within AsterR.

How is AsterR Used? :  Passing R Functions and Code for Aster Execution

While AsterR has many different analytic functions, R has more. To make use of these existing R functions (which may include R code built by users), AsterR has a set of functions that take math, logic, or R functions and apply them within Aster.

Many of the passing functions below make use of the way the apply family of functions work within R.  As is typical of R, they use vectorized operations where the desired function is executed along the vector.  The difference in AsterR is that with these functions that execution of the R function occurs in Aster rather than R itself. 

P_Value Example:


Establish R formula

t.test.no.factor  <-  function(y){

  p_value <- t.test(y[,1],y[,2])$p.value

  return(p_value)

}

Pass R Function to Aster and execute as vectorized operation


t.test.result <- ta.tapply(tadf[, 1:2], INDEX = tadf[, 3],
FUN = t.test.no.factor  )



Not all R code/functions will work in a parallel manner, but in most cases this will not matter. Aster can process non-parallel operations on a single “vWorker” which is the size and capacity of a typical server on its own. In most cases, R users will not be processing billions of rows, but merely millions or hundreds of thousands. In these normal cases, the difference between the full MPP optimal power and the single threaded vWorker power is a difference of seconds.

Diagram:  The "R Runners" are a set of the passing functions that use R's own split-apply-combine paradigm to execute R code in Aster and, where possible, use Map-Reduce execution allocating processes to multiple Aster vWorkers.

How is AsterR Used? :  Power Users Blending R and SQL

Another broad use case is the fluid integration of SQL and R (within the R client) for those users who know both.  R has certain strengths, Aster itself has analytic strengths, but adding Aster’s basic SQL foundation can take things to the next level, especially for data transformation.

Diagram of AsterR Architecture