The R and Python programming languages are very popular for statistical and mathematical tasks, and notable examples of open source software for analytics. Both are enjoying continuing development by means of new packages that add to specific functionality, and both have been gaining significant popularity and influence outside their traditionally academic boundaries of application.
In particular with regards to R, Teradata has recognized the popularity that the language was gaining in the late 2000’s by creating their first package for R, known as teradataR (see, for example, topics here and here). teradataR was built with refactored technology from Teradata Warehouse Miner that pushed popular R functions into the database for scalable processing against large volumes of data. Teradata continued to invest in scalable versions of R through its partnership with Revolution Analytics who produce the Revolution R Enterprise™ solution, and Fuzzy Logix who provide AdapteR™ to be used in conjunction with their DB Lytix™ software. Realizing that not all our customers wanted a commercial license of R, Teradata developed generic script processing operators for both the Teradata and Aster Databases and provide even more R processing options in 2015 for the Teradata Unified Data Architecture™.
As of the Teradata Database version 15.00, a new table operator (TO) is available that is called the SCRIPT Table Operator (STO). The STO can be used to execute user-installed scripts inside the Teradata Database. Essentially, this enables direct execution of external language scripts, such as R and Python, for in-database analytics with data fed from the Teradata Database.
To this end, as of April 2015 the Orange Book "R and Python Analytics with the SCRIPT Table Operator" is available as an introductory guide on the topic. Specifically, this Orange Book explores the capabilities of the STO and in-database analytics options with R and Python. The book explains how to install R and Python on Teradata Database systems of version 15.00 or higher; it contains selected examples of STO use cases of row-independent, partition-independent, and system-wide parallelization operations by using R and Python scripts; and it investigates in a step-wise manner how to use the STO with Teradata Warehouse Miner. The book is composed from a practitioner's perspective of performing tasks through sequential steps, and contains numerous notes and tips about running R and Python scripts on the database. The data and code for the examples are bundled in a compressed file pointed to by the book.
This Orange Book is available for Teradata associates at the Sharepoint website:
whereas Teradata customers and partners with "Knowledge Search" permission can request this Orange Book through Teradata @ Your Service at:
As of today 4/22/2015, an updated version B02 of the above Orange Book is available at the above sites that replaces the previous version A03. If you downloaded any previous version, it is recommended that you replace your previous document with the new one that contains more current information and links to updated example scripts, too.
It's out, it's new, it's bigger and better... You can visit the links in the main post above to obtain the C02 update version of this Orange Book. A brief changelog for the new version follows:
1. Contains fixes to several typos, and improves text clarity across the document. Some additional installation-related details are provided in Chapter 3.
2. Enriched examples in Chapter 6 "Advanced Topics":
trying to create a little more robust example of R invoked through teradata using SCRIPT
The following query creates a result set successfully:
SELECT DISTINCT T1,V1,V2,V3 FROM SCRIPT (SCRIPT_COMMAND('/opt/teradata/Rpackages/R-3.1.3/bin/R \ --vanilla --slave \ -e "write.table( data.frame( type=as.factor(sample(c(\"foo\",\"Bar\",\"BAZ\"), 10, replace=TRUE)), value1=1:10, \ value2=((1:20)/3), value3=pi), \ row.names=FALSE,col.names=FALSE,sep=\"\\t\")"') RETURNS('T1 varchar(10)', 'V1 Int', 'V2 Float', 'V3 Float') );
However when I try to use the R function
to create a column with random float data, I get an empty result set and an error.
SELECT DISTINCT T1,V1,V2,V3 FROM SCRIPT (SCRIPT_COMMAND('/opt/teradata/Rpackages/R-3.1.3/bin/R \ --vanilla --slave \ -e "write.table( data.frame( type=as.factor(sample(c(\"foo\",\"Bar\",\"BAZ\"), 10, replace=TRUE)), value1=1:10, \ value2=((1:20)/3), value3=runif(10)), \ row.names=FALSE,col.names=FALSE,sep=\"\\t\")"') RETURNS('T1 varchar(10)', 'V1 Int', 'V2 Float', 'V3 Float') );
Executed as Single statement. Failed [9134 : HY000] Error in function SCRIPT: SCRIPT_COMMAND returned exit value 1 - errmsg "Execution halted" Elapsed time = 00:00:00.468 STATEMENT 1: Select Statement failed.
I'm trying to understand what is different about the runif function and pi or the other float value (1:20)/3.
Any suggestions would be appreciated.
Other than using the new Teradata Analytics Platform, is this still the best and most efficient method to run R and/or Python with Teradata?
Python, yes, and perhaps simple R scripts too. Script is fine for scoring, but aggregation requires extra effort: for example, using an average() function in Script gives you the average for each AMP, but computing the overall average requires a sum() and a count().
Check out the ExecR() table operator, which enables running an R script as a table operator processing data on all AMPs in parallel. The External Routine Programming manual shows an example of computing K Means using ExecR: table operator functionality enables redistributing data across AMPs (in this case to just one AMP). In this way aggregations can be done on the whole data set, not just one AMP.