Learn Data Science - Page 2

Learn Data Science
Looking Glass

Explore and Discover

Latest Articles, Videos and Blog Posts speak with those interested in analytics-based decision making. Browse all content in the Teradata Data Science Community to gain valuable insights.

980 Views
0 Comments

Data Science can be an adventure in every possible way - just ask your employee who has been to tasked to solve data science problems. Did you know that the whole zen of data science thrives on Trial/Error AND a culture of failing fast?

If we talk to people in an analytics department in a company, you are going to find people who approach business problems two ways:

  1. I will try stuff that I know and probably can produce a visual. I've done it before many times before and so it should work. However, I know I can try this new stuff, but I'm not sure what will come out of it. So I'd skip this crazy idea. Just by looking at the data I can tell the visual insight will suck or I will fail miserably.
  2. I wouldn't know the visual will look when I try this. I'm willing to go for it anyways! I want to see what happens and don't want to guess. The data doesn't look that interesting to me at the outset, but willing to create a visual just for the fun of it even if it comes out boring. What if it's something useful?

#2 approach is what makes up the Trial/Error and Fail Fast culture. Imagine putting a super fast iterative tool (Teradata Aster Analytics or Apache Spark) on the hands of the person who practices #2 above!

Trial/Error and Fail Fast culture doesn't mean data scientists are unwilling to try time tested methods. It only means they are willing to take a lot of 'quickfire' risks for better results!

Just a bit of luck and off to the next iteration of failing fast and keep building!

A bit more on Trial/Error and Fail Fast. What exactly is it?

Trial/Error and Fail Fast approach is trying things with very little expectations on the outcome. Of course, the business outcome should be the eventual goal. We are willing to try something quickly and expect not to be rewarded immediately. Also not giving up just because we failed to get an outcome that's interesting the first time. 9 out 10 times, we are fumbling, but willing to get lucky once without giving up- which often proves to most valuable and actually works. Most successful data scientists will choose a fail fast tool for their pursuit for doing trial and error. The more we allow ourselves to fail quickly, the sooner we are going to stumble into something incredibly useful.

© Can Stock Photo / leowolfert

Causality Detection vs Quantitative Modeling

From a 10K feet point of view, most data science problems have two aspects to it:

  • Causality Detection - find the root cause of the problem or situation.
  • Quantitative Modeling - try to predict a situation outcome after learning from a bunch of data. You don't need to know the cause of the problem for prediction, just modeling with different variables. Algorithms take care of mapping the outcome to inputs done correctly and will do robot prediction.

Both of the above require a bit of creativity. Causality Detection is probably the hardest and is 100 times harder as it requires a lot of domain knowledge and some cleverness. It's great to know that I can predict a part failure 8 of 10 times, but knowing why and getting to the root cause is a completely different animal. You can get away with not being a domain expert with Quantitative Modeling. With Causality Detection, only a domain expert can say A leads to B definitely.

Applying Trial/Error and Fail Fast approach to Quantitative Modeling means we are trying different algorithms, model parameters, features in the data, new sources iteratively until we reach your accuracy goal *REALLY QUICKLY*. There is a systematic method to some of the techniques now, but still, requires TRYING many things before something works.

Causality Detection as mentioned earlier is a bit different. We can try and fail fast on a few A/B testing approaches but requires careful navigation through multiple steps with each step taken ever so carefully and surely. Causality Detection is about eliminating uncertainty as we get really close to the root cause.

Working in an uncertain environment

On unknown situations or problems, most folks want a cookie cutter approach - unfortunately, data science brings a lot of uncertainty to a table. Even with techniques like Deep Learning which works out of the box with a startup random configuration, getting to the next level often seems to be challenging and tricky. As architectures become more complex, the science often depends on the trial/error art form solely dependent on the creative data scientist's efforts in addition to best practices developed over time.

Hope you enjoyed the blog post.

  • Aster Blog
833 Views
0 Comments

Using an agile approach, a cross-functional team of Doctors, Cancer Researchers, Data Scientists, Data Visualization Experts, and Technologists set out on a mission to understand over 1,000 genetic patterns of cancer in order to develop personalize medical treatments aligned to the genetic makeup of humans.  

Decoding the human genome is the next frontier in science and discovery in medicine. Today, the combination of data, analytics, and visualization tools are cutting edge innovation in life sciences.  View the video below. 

Genome World Window - Stephen Brobst and Andrew Cardno

Video Link : 1092

Art of Analytics - Genome World Window - YouTube 

________________

Related Links:

Data Scientist

Data Visualization

785 Views
0 Comments

Combining the collaborative expertise of data scientists, geophysicists and data visualization an integrated oil company developed new understandings of complex reservoir management with data and analytics. This business case easily transcends multiple industries focused on asset utilization and optimization.

The Sailor - Duncan Irving 

Video Link : 1091

Art of Analytics: The Sailor - YouTube 

______

Related Links

Data Visualization

993 Views
0 Comments

Fusing business acumen, data science, and creative visualization, the Burning Leaf of Spending enabled a major bank to detect anomalies in customer spending patterns that indicate major life events, and provided artful insights into the personalized service required to enhance the customer experience, improving lifetime value.

Burning Leaf of Spending - Christopher Hillman 

Video Link : 1090

________________

Related Link:

Detecting Anomalies

1106 Views
0 Comments

Advanced analytic techniques, combined with innovative data visualization, identifies suspected claims fraud, enabling a property and casualty insurer to automate new business rules to improve the fraud investigation process and mitigate risk.

Text Mining

Graph Analysis

Cluster Analysis

Predictive Analytics

All techniques behind fraud invaders. Giving business value in risk mitigation to any industry to identify exposure across the enterprise.  Giving companies the ability to detect suspicious fraud at higher rates with new information sources and increased automation. Avoiding costs and liability.

Art of Analytics: Fraud Invaders - Christopher Hillman 

Video Link : 1089

Other Content by Christopher Hillman 

________________

Related Links:

Text Mining 

Graph Analysis

Cluster Analysis

Predictive Analysis

Detecting Fraud

791 Views
0 Comments

Using analytic techniques that normally follow the "Customer Journey," Teradata Think Big consultants and data scientists use data and analytics to visualize & identify ‘The Human Journey,” allowing Buttle UK to identify and fulfill needs for at risk.

Video Link : 1088

1529 Views
0 Comments

Aster's nPath function is a powerful tool for uncovering patterns in sequential data.  There are many examples here on the Aster Community for writing those queries and visualizing the output.  But that's not all Aster has to offer for path analysis.  There are more steps that you can take to create even more value.  I'm going to show you a few easy queries that will take the output from nPath and start to build out a model that may be used to predict the next step on a path.

I'll use the demo data set that ships with Aster Express, bank_web_clicks, though really any output from your own nPath analysis will work just as well.  Here's a simple Tree diagram that I've created from this data, showing a handful of paths that customers have taken on this web site.

These Tree diagrams are incredible useful for visualizing paths and helping to uncover patterns for deeper analysis.  Additionally, having an underlying table with metrics around parent nodes, child nodes, counts and percentages could begin to form the foundation for a model to predict likely outcomes for other customers on similar journeys, allowing you to answer such questions as "what's the next likely step my customer will take".  In a few steps, I'll show how to build out a table with rows like this showing Parent paths and Next steps with calculated probability:

ParentNextProbability
ACCOUNT SUMMARY, BILL MANAGER FORM
BILL MANAGER ENROLLMENT
1
ACCOUNT SUMMARY, FUNDS TRANSFER, CUSTOMER SUPPORT
FAQ
.59
ACCOUNT SUMMARY, FUNDS TRANSFER, ONLINE STATEMENT ENROLLMENT
ACCOUNT HISTORY
.54
ACCOUNT SUMMARY, FAQ, CUSTOMER SUPPORT
ACCOUNT HISTORY
.42

Introducing the Aster path_analyzer function.  This takes the output from your nPath query, with 'group by' counts for each path, and then breaks out all the paths and counts from each parent node, working its way down the full depth of the tree.  Getting the "top paths and counts" from nPath is a pretty standard part of our normal nPath workflow and done easily with a simple query.  Let's build a new table, called 'top_bank_paths'.  Here I'll limit this to the top 1000 paths.  (Since these tables are small, I'm using 'replication' as my distribution).

create table top_bank_paths
   distribute by replication
as
select path,count(*) as cnt
from bank_paths
group by 1
order by 2 desc
limit 1000
;

With results:

pathcnt
[ACCOUNT SUMMARY, FAQ]2854
[ACCOUNT SUMMARY, VIEW DEPOSIT DETAILS]2830
[ACCOUNT SUMMARY, FUNDS TRANSFER]2746
[ACCOUNT SUMMARY, ACCOUNT SUMMARY]2715

We will now use these as input to path_analyzer:

SELECT * FROM Path_Analyzer
(
    ON (SELECT 1)
    PARTITION BY 1
    InputTable ('top_bank_paths')
    OutputTable ('bank_path_analyzer_output')
    --SeqColumn ('path')  Aster v7.0
    --CountColumn ('cnt') Aster v7.0
    Seq('path')  --Aster v6.20
    Cnt('cnt')   --Aster v6.20
);

The syntax did change slightly for the Seq and Cnt parameters with Aster Analytics 7.0, so I've included both above. Simply comment/uncomment based on your library version.  Also, if you want to rerun path_analyzer, you'll need to first drop your output table.  In this example, that would be:

     drop table bank_path_analyzer_output;

The output is a bit verbose, so here are the important fields:

parentchildrensubpath_cntdepth
^,[ACCOUNT SUMMARY  [(^,[ACCOUNT SUMMARY, BILL MANAGER, ACCOUNT HISTORY]),(^,[ACCOUNT SUMMARY, BILL MANAGER, ACCOUNT SUMMARY]),(^,[ACCOUNT SUMMARY, BILL MANAGER, FAQ]),(^,[ACCOUNT SUMMARY, BILL MANAGER, FUNDS TRANSFER]),(^,[ACCOUNT SUMMARY, BILL MANAGER, VIEW DEPOSIT DETAILS])]702

 

With this 'not too complex' SQL, we can now build a table for each parent node with corresponding 'next child node' and percentage calculation.  I'm doing a little string manipulation; skipping those first 4 characters in the parent format and likewise in the 'next_node' field, and trimming the trailing ']' character, which represents 'end of path'.  I'm also using a window function, row_number(),  to rank each child by percentage so that we can later filter by top 1 or 2 child nodes:

create table top_bank_next_path
    distribute by replication
as
select substring(N.parent from 4) as parent
     ,P.depth
     ,replace(substring(N.node from (3 + length(N.parent))),']','') as next_node
     ,P.subpath_cnt as parent_cnt
     ,N.subpath_cnt as next_cnt
     ,(1.0 * N.subpath_cnt / P.subpath_cnt)::numeric(3,2) as probability
     ,row_number() over (partition by N.parent order by N.subpath_cnt desc) as rank
from bank_path_analyzer_output N
    ,bank_path_analyzer_output P
where P.depth > 0
and   N.parent = P.node
;

parentdepthnext_nodeparent_cntnext_cntprobabilityrank
ACCOUNT SUMMARY, ACCOUNT SUMMARY, FAQ, ONLINE STATEMENT ENROLLMENT4VIEW DEPOSIT DETAILS111111
ACCOUNT SUMMARY, FUNDS TRANSFER, FUNDS TRANSFER, FUNDS TRANSFER4ACCOUNT HISTORY32180.561
ACCOUNT SUMMARY, FUNDS TRANSFER, FUNDS TRANSFER, FUNDS TRANSFER4ACCOUNT SUMMARY32140.442
ACCOUNT SUMMARY, VIEW DEPOSIT DETAILS, FUNDS TRANSFER, ACCOUNT HISTORY4FUNDS TRANSFER48190.41
ACCOUNT SUMMARY, FAQ, FAQ, FAQ4ACCOUNT SUMMARY58220.381

And finally, drum roll, here are some answers to the question "what is my customer likely to do next based on where they are in their journey".  In this next query, I'm using some parameters like 'probability greater than 30%', 'more than 20 similar paths have occurred' and perhaps most importantly, 'where rank equals 1', which means its the mostly likely next step for each parent path.

select parent,next_node
     ,parent_cnt,probability
from demo.top_bank_next_path
where rank = 1
and probability > .3
and parent_cnt > 20
order by probability desc
limit 10
;

parentnext_nodeparent_cntprobability
ACCOUNT SUMMARY, BILL MANAGER FORMBILL MANAGER ENROLLMENT331.0
ACCOUNT SUMMARY, FUNDS TRANSFER, CUSTOMER SUPPORTFAQ290.59
ACCOUNT SUMMARY, FUNDS TRANSFER, FUNDS TRANSFER, FUNDS TRANSFERACCOUNT HISTORY320.56
ACCOUNT SUMMARY, ACCOUNT SUMMARY, ONLINE STATEMENT ENROLLMENTFUNDS TRANSFER270.56
ACCOUNT SUMMARY, CUSTOMER SUPPORT, ONLINE STATEMENT ENROLLMENTACCOUNT HISTORY270.56
ACCOUNT SUMMARY, PROFILE UPDATE, CUSTOMER SUPPORTACCOUNT HISTORY270.56
ACCOUNT SUMMARY, FUNDS TRANSFER, ONLINE STATEMENT ENROLLMENTACCOUNT HISTORY240.54
ACCOUNT SUMMARY, FAQ, CUSTOMER SUPPORTACCOUNT HISTORY430.42
ACCOUNT SUMMARY, VIEW DEPOSIT DETAILS, FUNDS TRANSFER, ACCOUNT HISTORYFUNDS TRANSFER480.4
ACCOUNT SUMMARY, PROFILE UPDATE, FUNDS TRANSFERFUNDS TRANSFER640.39

There it is, a pretty straight forward, and I think a very powerful new step to add to all your Aster nPath workflows to create a very useful new table to answer many question from your pathing data.

Enjoy!

697 Views
0 Comments

Did you miss the live webinar?  You can listen to the replay featuring Matt Mazzarell‌, Data Scientist, Teradata. 

Video Link : 1086

Unable to View Video here?  Use this link:  Combining Text and Graph Analytics for More Context in Your Networks featuring Matt Mazzarell - YouT... 


Video Details:

    Adding more techniques in text analytics can better inform you of the working relationships in a network. Topic analysis, sentiment extraction and predictive analytics have helped large organizations solve complex problems using large volumes of communication data from various sources. This webinar sessions explores some examples of how customers have used these techniques to add more context to their networks.

 

Matt Mazzarell is a Data Scientist in the Teradata Aster Solutions team with experience implementing analytic solutions for many clients in the Fortune 1000 who seek to solve business problems using large and varied data sets. He has experience collecting customer requirements and prescribing specific analytic techniques across all industry verticals. Matt is responsible for several pre-packaged solutions including Communications Compliance, Product Recommender, and Healthcare Fraud. He is dedicated to enabling the non-technical user to perform analytics at scale that drive significant value for the business.

 

 

 


Organization Network Analysis (ONA) In Action -  Connected Commons Webinar Series

1944 Views
1 Comment

Some of you at Partners 2016 wanted to play with the QGRAM function to do text matching besides the IdentityMatch() function. I explain here how to invoke it and you can add that to your environment and try it out. 

Algorithm Background

Read more about QGRAMs, Cosine Similarity etc., in my previous blog post to understand different text similarity techniques.

https://aster-community.teradata.com/community/learn-aster/blog/2015/06/21/data-science--fuzzy-match...

In this blog post, I only explain a simple use case of matching short text, especially merchant names in transactions.

Overview

If a business has transaction data that has merchant names, it's often a common problem to 'disambiguate' those names and map that to a yellow page/registered business name.  If you have millions of business names that mutate over time, morph etc., in the data, it is almost next to impossible with traditional techniques such as string matching/regular expressions/SQL LIKEs etc.,This doc. here explains how we solved this using a technique called 'Weighted QGRAMS with Cosine similarity'. The SQL/MR function QGRAM was created by the Aster PS and combined that with the 6.10 Analytic function called VectorDistance() that computes the cosine similarity metric. It uses a probabilistic technique to break down the merchant names to Q-grams (or letter ngrams), weigh it based on the position in the string and then by applying Cosine Similarity to find the best match.

 

SQL/MR Functions Used:

QGRAM() must be installed in your cluster schema.  Download the SQL MR: QGRAM  and install it your schema or request your DBA to install in your cluster using:

\install qgram.zip

from act. 

VectorDistance() is available in the later Analytic foundation libraries and should be used in conjunction with QGRAM.

Inputs:

Yellow Page Dictionary (standardized names you want to rollup to):

 

id

merchant

1

Ace Mortgage Inc

2

Ace Credit Card Inc

3

Ace Credit Bank

4

Ceasar Frozen Yogurt LLC

5

8/12 Convienence Store

6

Teradata Corp

7

Walmart Super Market

8

Apple 

9

Wholefoods

10

San Mateo City

 

Merchant names that need to be 'disambiguated' or fuzzy joined with above table:

id

bizname

1

Ace MG

2

ACC

3

ACB

4

Ceasar Frozen Yogurt

5

8/12 Store

6

Teradata Corporation

7

Walmart Super M

8

apple

9

Whole Foods Market

10

San Mateo City

Final Output (after the Weighted QGRAM/Cosine Similarity Workflow next section):

bizname

merchant

cos_score (or fuzzy score/confidence)

Ace MG

Ace Mortgage Inc

0.435152

ACC

Ace Credit Card Inc

0.0549411

ACB

Ace Credit Bank

0.0149411

Ceasar Frozen Yogurt

Ceasar Frozen Yogurt LLC

0.960521

8/12 Store

8/12 Convienence Store

0.566962

Teradata Corporation

Teradata Corp

0.844375

Walmart Super M

Walmart Super Market

0.890216

apple

Apple 

0.757017

Whole Foods Market

Wholefoods

0.545401

San Mateo City

San Mateo City

1.0

SQL, SQL/MR Workflow code to try in your Teradata Studio:

drop table if exists bizname_yellow_pages;
create table bizname_yellow_pages
(
 id bigint,
 merchant varchar
) distribute by hash(id);

insert into bizname_yellow_pages values (1,'Ace Mortgage Inc');
insert into bizname_yellow_pages values (2,'Ace Credit Card Inc');
insert into bizname_yellow_pages values (3,'Ace Bank Inc');
insert into bizname_yellow_pages values (4,'Ceasar Frozen Yogurt LLC');
insert into bizname_yellow_pages values (5,'8/12 Convienence Store');
insert into bizname_yellow_pages values (6,'Teradata Corp');
insert into bizname_yellow_pages values (7,'Walmart Super Market');
insert into bizname_yellow_pages values (8,'Apple');
insert into bizname_yellow_pages values (9,'Wholefoods');
insert into bizname_yellow_pages values (10,'San Mateo City');

drop table if exists operational_biznames;
create table operational_biznames
(
 id bigint,
 bizname varchar
) distribute by hash(id);

insert into operational_biznames values(1,'Ace MG');
insert into operational_biznames values(2,'ACC');
insert into operational_biznames values(3,'ACB');
insert into operational_biznames values(4,'Ceasar Frozen Yogurt');
insert into operational_biznames values(5,'8/12 Store');
insert into operational_biznames values(6,'Teradata Corporation');
insert into operational_biznames values(7,'Walmart Super M');
insert into operational_biznames values(8,'apple');
insert into operational_biznames values(9,'Whole Foods Market');
insert into operational_biznames values(10,'San Mateo City');

drop table if exists qgram_yp;
create table qgram_yp distribute by hash(qgram)
as
select id, qgram, weight
   from qgram(
       on bizname_yellow_pages
       decay(0.95)
       gram(3)
);

drop table if exists qgram_ops;
create table qgram_ops distribute by hash(qgram)
as
select id, qgram, weight
 from qgram(
     on operational_biznames
     decay(0.95)
     gram(3)
);

drop table if exists cosine_similarity_match;
create table cosine_similarity_match distribute by hash(target_id)
as
select * from VectorDistance(
   on qgram_ops as target partition by id
   on qgram_yp as ref dimension
   targetIdColumns('id')
   targetFeatureColumn('qgram')
   targetValueColumn('weight')
   measure('Cosine')
   topK(1)
);

select b.id id, b.bizname, c.merchant, 1- a.distance as cosine_score
 from
cosine_similarity_match a
   inner join operational_biznames b on (a.target_id = b.id)
   inner join bizname_yellow_pages c on (a.ref_id = c.id)
order by id
;

 

2918 Views
0 Comments

Table of contents

Previously in this series

In my last post I started exploring concrete implementation of data science pipelines with Aster and R. We covered programming environment, loading data into Aster, and importance and easiness of testing every step in the pipeline. In this post we'll focus on perhaps the most prevalent part of every workflow - data manipulation. To make examples more translatable we will embrace dual approach of illustrating each technique with 2 equivalent examples: one using  TeradataAsterR and another using dplyr package. Finally, in the spirit of embracing "test everything" each example will result in comparing them for equivalency.

The Grammar of Data Manipulation

dplyr function design follows clear separation of concern of the split-apply-combine process and the grammar of data manipulation:

  • filter: keep rows with matching conditions;
  • select: select, drop or rename attributes by name;
  • arrange: order rows by attributes;
  • mutate: add new attributes;
  • summarise: reduce multiple attribute values to a single value;
  • group_by: group data by one or more variables to apply a function.

Graph demonstrating Split-Apply-Combine.

TeradataAsterR functions are the mix of 2 approaches: classic R and split-apply-combine process embraced by dplyr. So having examples for both will add clarity and understanding to how things get done in both cases.

But before jumping to the grammar of data manipulation functions we begin with complimentary operation that combines data from two data sets into one. 

Joining Tables

Aster is a relational data store (sorry if I never made this important point before) so its support for joining tables comes naturally with fully SQL-compliant SELECT statement that includes:

  • Inner join
  • Left, right, and full outer joins
  • Semi- and anti-join (same as filtering joins in dplyr)
  • Cross join (Cartesian product)

In Aster R this translates  into two  functions that both perform joins resulting in a virtual data frame of query type and both act on two virtual data frames:

  • ta.merge with R style merge parameters that will be translated to SQL
  • ta.join that offers SQL-like parameters including choice of join type

I suggest using the latter as it offers full functionality with better transparency and compatibility with actual operations taking place in Aster.

For our examples we need to combine batting stats in Batting with players attributes from Master. This requires inner join and for additional clarity we demonstrate how to accomplish this same operation using three methods: R base function merge(), dplyr function inner_join(), and Aster R ta.join():

# R
big.df = merge(Batting, Master, by="playerID")

# dplyr
big.tbl = Batting %>%
  inner_join(Master, by="playerID")

# Aster R
big.ta = ta.join(batting.ta, master.ta, by="playerID")
ta.colnames(big.ta)[c('x.row_names','x.playerID')] = c('row_names','playerID')‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Aster R required some post-processing to remove prefixes in the attribute names (prefixed by ta.join() only to those that were found in both tables). Other than that all three look boringly similar  (if you ignore syntactic sugar of magrittr pipes. Yes, I had option of not using it but the goal was to have each example as clear as possible).

As always we test all 3 methods produced same results (notice that we ran 2 comparisons taking advantage of transitive property of the equivalency):

compareDataInRandAster(big.df, big.tbl, 
  key = c("playerID", "yearID", "stint"))

compareDataInRandAster(big.tbl, big.ta,
  key = c("playerID", "yearID", "stint"))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Note that playerID is both a primary key in Master and a foreign key in Batting which was sufficient in join. But resulting data set inherited the key from Batting that consists of 3 attributes: plyaerID, yearID, and stint.

We'll use big.xxx data sets to illustrate the grammar of data manipulation functionality as part of the pipeline workflow (we will limit examples to dplyr and TeradataAsterR functions from now on).

Transform

Feature engineering is one of top reasons new attributes get introduced in the data science pipelines and is often a key ingredient of successful models. For this purpose the grammar of data manipulation contains mutate function that adds new or changes existing attributes: dplyr keeps calling it mutate() while Aster R has ta.transform().

The following example creates two attributes:

  • key containing concatenated elements of composite table key that uniquely references each row,
  • age with player's age calculated using year of birth and current year:
# dplyr
big.prepped.1.tbl = big.tbl %>%
  mutate(key = paste(playerID, yearID, stint, sep="-"),
         age = yearID - birthYear)

# Aster R
big.prepped.1.ta = ta.transform(big.ta,
    key = paste(playerID, yearID, stint, sep="-"),
    age = yearID - birthYear)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Again, examples contain strikingly similar code for both. Still, underneath, there are some differences to be aware of. For example, while mutate() can immediately use newly introduced variables inside the call ta.transform() can't.

As before (and after as well) we test results of both functions for equivalency:

compareDataInRandAster(big.prepped.1.tbl, big.prepped.1.ta,
    key = c("key"))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Note how new attribute key being a unique reference was put immediately to work in the test.

Subset

Arguably the most common data manipulation operation is reducing data set to smaller one which may take two forms:

  • selecting attributes (vertical subset);
  • selecting rows (horizontal subset).

While dplyr uses functions select() and filter() respectively, Aster R combines conventional R to select attributes and function ta.subset() to select rows.

 We continue our example by selecting batting metrics identified by their names and stored in a variable batting_metrics while removing all rows before year 1995 (yearID >= 1995) and players without home runs (!is.na(HR) which Aster translates to IS NOT NULL operator):

batting_metrics = c("G", "AB", "R", "H", "HR", "BB", "SO")

# dplyr
big.prepped.2.tbl = big.prepped.1.tbl %>%
  select_(quote(c(key, playerID, yearID, stint, age)),
          .dots=batting_metrics) %>%
  filter(!is.na(HR) & yearID >= 1995)

# Aster R
big.prepped.2.ta = ta.subset(big.prepped.1.ta[,c("key","playerID","yearID","stint","age",
    batting_metrics, "row_names")],
    !is.na(HR) & yearID >= 1995)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

One distinct feature of dplyr is its use of non-standard evaluation (NSE, see vignette("nse") for details) which is great for translating R to SQL and interactive use but is hard to program with. For that reason dplyr offers SE versions of its functions designated with suffix _. With Aster R we use NSE inside ta.submit() but not with attribute selection that uses data frame syntax.

And our standard test comparing results in dplyr and Aster R completes this example:

compareDataInRandAster(big.prepped.2.tbl, big.prepped.2.ta, key = "key")‍‍‍‍‍‍‍‍‍‍

Summarization

Data manipulation function that reduces multiple attribute values to a single value is usually referred to as summarization. With Aster R we have at least three approaches to data summarization:

  • aggregates using SQL GROUP BY, e.g. MAX, MIN, AVG, etc. with ta.summarise() or ta.summarise.each(),
  • SQL window functions available with ta.transform(),
  • and in-database R using ta.tapply() or ta.by() that offers the most flexible  option of pushing arbitrary R code to run in parallel over partitioned data inside the Aster database.

Note that summarizing doesn't necessarily imply collapsing of data as reduced values could  be attached to the original rows (the case for window functions). What summarizing always includes is dividing operation into two steps:

  1. a grouping of data based on the values of certain attribute(s) (GROUP BY clause in SQL or INDEX argument in tapply() and ta.tapply())
  2. a computation step that computes or reduces data with aggregation, window or arbitrary R functions within each group.

In fact, dplyr embraces this concept fully by always invoking a group_by() function first followed by appropriate value reduction function.

Using SQL GROUP BY

SQL GROUP BY clause should always come to mind first. Aster R implements it using ta.summarise() that encompasses both grouping and computation, while dplyr as mentioned before invokes these steps explicitly by composing the group_by() and summarise():

# dplyr
summary.1.tbl = big.prepped.2.tbl %>%
     group_by(playerID, yearID, age) %>%
     summarise_each_(funs(sum), batting_metrics)

# Aster R
summary.1.ta = ta.summarise(big.prepped.2.ta,
     group.by = c("playerID", "yearID", "age"),
     G=sum(G), AB=sum(AB), R=sum(R), H=sum(H), HR=sum(HR),
     BB=sum(BB), SO=sum(SO)) 

compareDataInRandAster(summary.1.tbl, summary.1.ta,
     key = c("playerID","yearID"))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

In this example we consolidated the same player's records within single year into one record for simplicity. While it's not common that players change teams within the same season it does happen. And Lahman designates a record for each player's team stint, which for our purpose we consider unnecessary detail. So aggregates were used to consolidate stats to one player per year by summing up player's stats from all teams he played that season. Again, with dplyr we used SE version summarize_each_() to take advantage of existing list of all metrics in batting_metrics. With Aster R we included all aggregates explicitly in ta.summarise().

Using Window Functions

We could devote a whole blog post (or a series, or a book, easily) to window functions. But let me instead quote three sources - one from Aster, one from PostgreSQL, and lastly from dplyr vignette:

Window functions allow the calculation of aggregates and other values on a per-row basis as opposed to a per-set or per-group basis... For example, a window function may be used to compute a running sum, a moving average, a delta between values in neighboring rows, or to apply ranking and row numbering to a table.

- Teradata Aster Database User Guide

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

- PostgreSQL 9.1 Manual

A window function is a variation on an aggregation function. Where an aggregation function, like sum() and mean(), takes n inputs and return a single value, a window function returns n values. The output of a window function depends on all its input values, so window functions don’t include functions that work element-wise, like + or round(). Window functions include variations on aggregate functions, like cumsum() and cummean(), functions for ranking and ordering, like rank(), and functions for taking offsets, like lead() and lag().

- Window functions vignette

Besides its per-row application another important feature of window functions is support for order within groups, which allows for enumerating rows, computing lags, cumulative and other order-dependent functions. Thus, the syntax supports both grouping and ordering options depending on specific window function.

For our example we will compute following five values:

  • current season number for each player with row_number();
  • current career HR value (up to current year) for each player with cummax();
  • career HR for each player with max() (we could compute same with summarise() but not without collapsing groups into single row unlike with window function when each row is preserved);
  • player's career start year with min();
  • player's career end year with max().
# dplyr
summary.2.tbl = summary.1.tbl %>% group_by(playerID) %>%
  mutate(seasonsTotal=n(),
    currentSeason=with_order(order_by = yearID, fun = row_number, x = yearID),
    currentTopHR=with_order(order_by = yearID, fun = cummax, x = HR),
    topHR=max(HR), startCareerYear = min(yearID), endCareerYear = max(yearID))

# Aster R
summary.2.ta = ta.transform(summary.1.ta,
  seasonsTotal=n(partition = playerID),
  currentSeason=row_number(order = yearID, partition = playerID),
  currentTopHR=cummax(HR, order = yearID, partition = playerID),
  topHR=max(HR, partition = playerID),
  startCareerYear = min(yearID, partition=playerID),
  endCareerYear = max(yearID, partition=playerID))

compareDataInRandAster(summary.2.tbl, summary.2.ta,
  key = c("playerID", "yearID"))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Again, let's go over differences and similarities between two implementations. As noted before dplyr always divides operation into composition of group_by() and mutate() where the latter invokes window and aggregate functions inheriting grouping and retaining each row. Aster R involves ta.transform() and applies window and aggregate functions inside the call.

Using In-Database R

At last we arrived at the most powerful option available with Aster R: in-database R execution. Looking at previous examples one may think that R programming really played a limited role as there were not much R code. Indeed, our Aster R constructs closely resembled SQL or grammar of data manipulation with both SQL GROUP BY and window functions inside the grouping and aggregation (data reduction) steps. While the same constructs preside over the option of in-database R the logic transitions to R-centric paradigm based on R tapply() function and functional feature of R.

With dplyr not much changes as its functionality naturally supports functional R and doesn't leave client R environment (at least without involving extensions). On the other hand Aster R provides us with function ta.tapply() that does two important things:

  1. it expands reach of tapply() onto Aster tables by enabling parallel and distributed execution within Aster database;
  2. and it transports R code for execution from Aster R client to Aster R sandbox environments configured and available with proper installation on the Aster database.

Availability and version of R on the Aster database could be easily checked with commands:

> ta.is.R.installed()
[1] TRUE
> ta.R.Version()‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Please consult with Teradata Aster R documentation for more details on how to configure and test R in-database environment.

For our example we want to answer the following question: for each player find his age when he had a career HR season (i.e. a season when player hit most HRs in his career). For simplicity we'll use topHR attribute from the previous example when calculating player's age when he hit most HRs in a season:

# dplyr
summary.3.tbl = summary.2.tbl %>%
  group_by(playerID) %>%
  summarize(
    seasons_total = seasonsTotal[[1]],
    max_hr = topHR[[1]],
    top_hr_age = age[HR == max_hr][[1]])

# Aster R
summary.3.ta = ta.tapply(summary.2.ta, INDEX=summary.2.ta$playerID,
  FUN=function(x){c(x$seasonsTotal[[1]], # seasons_total
                    x$topHR[[1]],        # max_hr
                    x$age[x$HR == x$topHR[[1]]][[1]])}, # top_hr_age
  out.tadf = list(columns=c("playerID", "seasons_total",
                            "max_hr", "top_hr_age")))

compareDataInRandAster(summary.3.tbl, summary.3.ta, key = "playerID")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Because Aster R version does things almost letter by letter like dplyr it is advisable to review the latter first. Our goal is to calculate player's stats so summarization will reduce number of rows for each player to 1, hence, we use group_by() with summarize() construct. Next, seasons_total and max_hr have the same values within each player's group so we always pick very first value for each. Things become R-stylish in the line 7 that assigns value to top_hr_age: given vector (array) age we pick its element per condition of corresponding HR value is equal to career maximum (i.e. equal to max_hr) . With Aster R we repeat the same logic but within tapply-style function call: each player's rows are designated by the variable x with anonymous function (line 11) assigned to argument FUN. Lastly, argument out.tadf (line 14) controls how Aster saves results in the database: in this case we explicitly rename column names using argument columns.

What really happened when we called ta.tapply() is radically different from all previous examples: the anonymous function assigned to FUN was serialized and sent over network to Aster worker nodes running R sandboxes where it was deserialized and loaded for execution. After that (you may think that was just a prep) Aster read the specification on how to run it from 2 arguments: virtual data frame (1st argument) referred to underling table and INDEX identified a column value to use to partition the data (line 10). At this point Aster has all ingredients to execute anonymous R function in-database on distributed data and in parallel. It will save results into new table and returns to Aster R that was waiting for ta.tapply(). After execution in Aster completes it returns a virtual data frame with results. Lastly see compareDataInRandAster() that tests for the same results with dplyr and Aster R.

Note that we could pass any R function (including references to 3d party packages given that they are installed on Aster as well) with FUN. This method elevates Aster R to both a framework and a platform for executing R code in parallel across distributed data in a manner consistent with R tapply() function. With proper configuration and algorithm design running parallel R should become a standard function in the daily data scientist routine with Aster. 

In the next post we shift gears to show elements of data science workflow such as exploratory analysis, principal component analysis (PCA) and predictive model.

Image sources:

Red Hat Developer Program

The Programming Historian

Bloggers
Top Kudoed Authors

Data Science Informative Articles and Blog Posts

Our blogs allows customers, prospects, partners, third-party influencers and analysts to share thoughts on a range of product and industry topics. Also, we have lots of content in the community; allowing you to gain valuable insights from Teradata data scientists.