Learn Data Science

Learn Data Science
Looking Glass

Explore and Discover

Latest Articles, Videos and Blog Posts speak with those interested in analytics-based decision making. Browse all content in the Teradata Data Science Community to gain valuable insights.

742 Views
0 Comments

The blogosphere is full of sound bites, anecdotes, and cliches on Machine Learning/AI and it's important to discern its place in the larger picture of Advanced Analytics/Data Science. Especially for those in business & IT who are estranged from the 'religious experience' of using the different tools available.

What is Advanced Analytics?

© Can Stock Photo / Jag_cz

Machine Learning and AI methods like Deep Learning fall into a larger 'analytic' library of things you can do with your data. Also, the access methods such as SQL and languages like Python, R etc., The best analogy is a bar or kitchen stocked with the most exotic stuff. As a bartender or chef, you get to make the best cocktails or entree, drawing from an assemblage of options. You may have your own favorite, but it's also important to fine tune stuff to an actual need that someone might want. Here's the thing - you may have the most expensive single malt or ingredient in your bar/kitchen and it doesn't mean everyone will want it! So variety is the key to delivering precisely what the business/end user wants!

Some Examples of Advanced Analytics Choices:

I need to find the top 10 key phrases in product or show reviews that tell me characters and powerful emotions. There are several options:

  1. Start NGRAMing with 1,2,3 terms at a time. Run weighting on keywords with TF/IDF. Show the top 10 - 1,2,3 NGRAMs with the highest IDF value. We can do this with some clever SQL :)
  2. Run a CRF model trained on a big corpus from a POS (parts of speech) output, weight it and sort it. You get the benefit of interesting verb and noun phrases (intelligent)
  3. Run a Word2Vec (Deep Learning/GPU) pass on the data and try to construct a neural embedding model to discover the phrases.
  4. <Add your own recipe/cocktail>

I want to group all my call center text into clusters, so I can see what they are talking about:

  1. Do a term weighting and some distance metric like Euclidian or Cosine and run Graph Modularity to chop clusters and run Phrase detection on each of those. Use a percentile technique to decide # of significant clusters.
  2. Run an LSI (Latent Semantic Indexing) dimension reduction and run K-Means. Decide # of clusters after finding the "elbow of significance"
  3. Run an LDA model and specify how many clusters. Find the # of topic clusters iteratively until it makes sense.
  4. <add your own ingredients/mixology>

With some cleverness, some iterative and grouping techniques above avoids mainstream Machine Learning completely and gets to 80% of the answer with simple sophistication. Of course, using advanced ML techniques will increasingly get us to 95% of the answer - especially when it comes to Fraud and other mission-critical "fine-grained" use cases. Let's put that aside for a moment.

For a lot of simple use cases or basic hypothesis testing, 80% answer may directionally "good enough" stuff for business and that's ok! The key is to have many options all the way from simple to complex with well-understood tradeoffs, such as performance, tuning complexity etc.,

How to get everything in one place? See Sri's blog:

Does your advanced analytics platform has what it takes to access the same data and combine all of t... - Check out Sri Raghavan's Blog

Here are some things users can do with above:

  • Come up with creative solutions that use best of the breed. Use SQL for what it's good at and run R in-database to do scoring or modeling using say ARIMA.
  • Or run LSTM deep learning for time series forecast on Tensorflow while using SQL to curate and organize the data in the same transaction.
  • Run large-scale PCA using SQL on Aster and do a logistic regression in a Spark cluster packaged in the platform. Put the results on a table for Tableau users to see the significant variables in a dashboard.
  • Run Aster's XGBoost on a churn analytic data set to create a model. Score users on the propensity to churn etc., Run Aster's Hidden Markov (Graph implementation) on the data set to find the latent transition matrix and emission probabilities.

Hope you enjoyed the blog post on the postmodern definition and sample usage of Advanced Analytics ;). More to come in my future blog posts on the science and art of using advanced analytics -> for business problems.

685 Views
0 Comments

Thanks to everyone who attended my presentation in Teradata Partners Innovation Forum. Here's the video of the deck I presented. I will post some code soon here for you to play with ...

Video Link : 1099

1309 Views
1 Comment

The Teradata Certified Professional Program is excited to announce the release of the first Teradata Aster Certification Exam – Teradata Aster Basics 6.10. Passing this exam will earn you the Teradata Aster Certified Professional Credential. Start preparing today and you’ll be one of the first to get Teradata Aster Certified. Click here for details, exam objectives and recommended training courses.

DOWNLOAD DATASHEET

 

Teradata Aster Analytics Basics Study GuideIntroducing the New Teradata Aster Basics Certification Study Guide

 

The Teradata Aster Basics Certification Study Guide, has been released!  Through simplified examples and explanations, this new guide helps certification candidates prepare for the Teradata Aster Basics 6.10 exam and achieve the Teradata Aster Certified Professional Certification.  The Teradata Certification Study Guides are the only Teradata authorized study publications. This guide is designed to complement Teradata training and deepen your knowledge in the defined certification objectives. All Certification Study Guides can be purchased at Amazon.com

 

Pursue Teradata Certification with Confidence

877 Views
0 Comments

Aster's nPath function is a powerful tool for uncovering patterns in sequential data.  There are many examples here on the Aster Community for writing those queries and visualizing the output.  But that's not all Aster has to offer for path analysis.  There are more steps that you can take to create even more value.  I'm going to show you a few easy queries that will take the output from nPath and start to build out a model that may be used to predict the next step on a path.

I'll use the demo data set that ships with Aster Express, bank_web_clicks, though really any output from your own nPath analysis will work just as well.  Here's a simple Tree diagram that I've created from this data, showing a handful of paths that customers have taken on this web site.

These Tree diagrams are incredible useful for visualizing paths and helping to uncover patterns for deeper analysis.  Additionally, having an underlying table with metrics around parent nodes, child nodes, counts and percentages could begin to form the foundation for a model to predict likely outcomes for other customers on similar journeys, allowing you to answer such questions as "what's the next likely step my customer will take".  In a few steps, I'll show how to build out a table with rows like this showing Parent paths and Next steps with calculated probability:

ParentNextProbability
ACCOUNT SUMMARY, BILL MANAGER FORM
BILL MANAGER ENROLLMENT
1
ACCOUNT SUMMARY, FUNDS TRANSFER, CUSTOMER SUPPORT
FAQ
.59
ACCOUNT SUMMARY, FUNDS TRANSFER, ONLINE STATEMENT ENROLLMENT
ACCOUNT HISTORY
.54
ACCOUNT SUMMARY, FAQ, CUSTOMER SUPPORT
ACCOUNT HISTORY
.42

Introducing the Aster path_analyzer function.  This takes the output from your nPath query, with 'group by' counts for each path, and then breaks out all the paths and counts from each parent node, working its way down the full depth of the tree.  Getting the "top paths and counts" from nPath is a pretty standard part of our normal nPath workflow and done easily with a simple query.  Let's build a new table, called 'top_bank_paths'.  Here I'll limit this to the top 1000 paths.  (Since these tables are small, I'm using 'replication' as my distribution).

create table top_bank_paths
   distribute by replication
as
select path,count(*) as cnt
from bank_paths
group by 1
order by 2 desc
limit 1000
;

With results:

pathcnt
[ACCOUNT SUMMARY, FAQ]2854
[ACCOUNT SUMMARY, VIEW DEPOSIT DETAILS]2830
[ACCOUNT SUMMARY, FUNDS TRANSFER]2746
[ACCOUNT SUMMARY, ACCOUNT SUMMARY]2715

We will now use these as input to path_analyzer:

SELECT * FROM Path_Analyzer
(
    ON (SELECT 1)
    PARTITION BY 1
    InputTable ('top_bank_paths')
    OutputTable ('bank_path_analyzer_output')
    --SeqColumn ('path')  Aster v7.0
    --CountColumn ('cnt') Aster v7.0
    Seq('path')  --Aster v6.20
    Cnt('cnt')   --Aster v6.20
);

The syntax did change slightly for the Seq and Cnt parameters with Aster Analytics 7.0, so I've included both above. Simply comment/uncomment based on your library version.  Also, if you want to rerun path_analyzer, you'll need to first drop your output table.  In this example, that would be:

     drop table bank_path_analyzer_output;

The output is a bit verbose, so here are the important fields:

parentchildrensubpath_cntdepth
^,[ACCOUNT SUMMARY  [(^,[ACCOUNT SUMMARY, BILL MANAGER, ACCOUNT HISTORY]),(^,[ACCOUNT SUMMARY, BILL MANAGER, ACCOUNT SUMMARY]),(^,[ACCOUNT SUMMARY, BILL MANAGER, FAQ]),(^,[ACCOUNT SUMMARY, BILL MANAGER, FUNDS TRANSFER]),(^,[ACCOUNT SUMMARY, BILL MANAGER, VIEW DEPOSIT DETAILS])]702

 

With this 'not too complex' SQL, we can now build a table for each parent node with corresponding 'next child node' and percentage calculation.  I'm doing a little string manipulation; skipping those first 4 characters in the parent format and likewise in the 'next_node' field, and trimming the trailing ']' character, which represents 'end of path'.  I'm also using a window function, row_number(),  to rank each child by percentage so that we can later filter by top 1 or 2 child nodes:

create table top_bank_next_path
    distribute by replication
as
select substring(N.parent from 4) as parent
     ,P.depth
     ,replace(substring(N.node from (3 + length(N.parent))),']','') as next_node
     ,P.subpath_cnt as parent_cnt
     ,N.subpath_cnt as next_cnt
     ,(1.0 * N.subpath_cnt / P.subpath_cnt)::numeric(3,2) as probability
     ,row_number() over (partition by N.parent order by N.subpath_cnt desc) as rank
from bank_path_analyzer_output N
    ,bank_path_analyzer_output P
where P.depth > 0
and   N.parent = P.node
;

parentdepthnext_nodeparent_cntnext_cntprobabilityrank
ACCOUNT SUMMARY, ACCOUNT SUMMARY, FAQ, ONLINE STATEMENT ENROLLMENT4VIEW DEPOSIT DETAILS111111
ACCOUNT SUMMARY, FUNDS TRANSFER, FUNDS TRANSFER, FUNDS TRANSFER4ACCOUNT HISTORY32180.561
ACCOUNT SUMMARY, FUNDS TRANSFER, FUNDS TRANSFER, FUNDS TRANSFER4ACCOUNT SUMMARY32140.442
ACCOUNT SUMMARY, VIEW DEPOSIT DETAILS, FUNDS TRANSFER, ACCOUNT HISTORY4FUNDS TRANSFER48190.41
ACCOUNT SUMMARY, FAQ, FAQ, FAQ4ACCOUNT SUMMARY58220.381

And finally, drum roll, here are some answers to the question "what is my customer likely to do next based on where they are in their journey".  In this next query, I'm using some parameters like 'probability greater than 30%', 'more than 20 similar paths have occurred' and perhaps most importantly, 'where rank equals 1', which means its the mostly likely next step for each parent path.

select parent,next_node
     ,parent_cnt,probability
from demo.top_bank_next_path
where rank = 1
and probability > .3
and parent_cnt > 20
order by probability desc
limit 10
;

parentnext_nodeparent_cntprobability
ACCOUNT SUMMARY, BILL MANAGER FORMBILL MANAGER ENROLLMENT331.0
ACCOUNT SUMMARY, FUNDS TRANSFER, CUSTOMER SUPPORTFAQ290.59
ACCOUNT SUMMARY, FUNDS TRANSFER, FUNDS TRANSFER, FUNDS TRANSFERACCOUNT HISTORY320.56
ACCOUNT SUMMARY, ACCOUNT SUMMARY, ONLINE STATEMENT ENROLLMENTFUNDS TRANSFER270.56
ACCOUNT SUMMARY, CUSTOMER SUPPORT, ONLINE STATEMENT ENROLLMENTACCOUNT HISTORY270.56
ACCOUNT SUMMARY, PROFILE UPDATE, CUSTOMER SUPPORTACCOUNT HISTORY270.56
ACCOUNT SUMMARY, FUNDS TRANSFER, ONLINE STATEMENT ENROLLMENTACCOUNT HISTORY240.54
ACCOUNT SUMMARY, FAQ, CUSTOMER SUPPORTACCOUNT HISTORY430.42
ACCOUNT SUMMARY, VIEW DEPOSIT DETAILS, FUNDS TRANSFER, ACCOUNT HISTORYFUNDS TRANSFER480.4
ACCOUNT SUMMARY, PROFILE UPDATE, FUNDS TRANSFERFUNDS TRANSFER640.39

There it is, a pretty straight forward, and I think a very powerful new step to add to all your Aster nPath workflows to create a very useful new table to answer many question from your pathing data.

Enjoy!

998 Views
1 Comment

Some of you at Partners 2016 wanted to play with the QGRAM function to do text matching besides the IdentityMatch() function. I explain here how to invoke it and you can add that to your environment and try it out. 

Algorithm Background

Read more about QGRAMs, Cosine Similarity etc., in my previous blog post to understand different text similarity techniques.

https://aster-community.teradata.com/community/learn-aster/blog/2015/06/21/data-science--fuzzy-match...

In this blog post, I only explain a simple use case of matching short text, especially merchant names in transactions.

Overview

If a business has transaction data that has merchant names, it's often a common problem to 'disambiguate' those names and map that to a yellow page/registered business name.  If you have millions of business names that mutate over time, morph etc., in the data, it is almost next to impossible with traditional techniques such as string matching/regular expressions/SQL LIKEs etc.,This doc. here explains how we solved this using a technique called 'Weighted QGRAMS with Cosine similarity'. The SQL/MR function QGRAM was created by the Aster PS and combined that with the 6.10 Analytic function called VectorDistance() that computes the cosine similarity metric. It uses a probabilistic technique to break down the merchant names to Q-grams (or letter ngrams), weigh it based on the position in the string and then by applying Cosine Similarity to find the best match.

 

SQL/MR Functions Used:

QGRAM() must be installed in your cluster schema.  Download the SQL MR: QGRAM  and install it your schema or request your DBA to install in your cluster using:

\install qgram.zip

from act. 

VectorDistance() is available in the later Analytic foundation libraries and should be used in conjunction with QGRAM.

Inputs:

Yellow Page Dictionary (standardized names you want to rollup to):

 

id

merchant

1

Ace Mortgage Inc

2

Ace Credit Card Inc

3

Ace Credit Bank

4

Ceasar Frozen Yogurt LLC

5

8/12 Convienence Store

6

Teradata Corp

7

Walmart Super Market

8

Apple 

9

Wholefoods

10

San Mateo City

 

Merchant names that need to be 'disambiguated' or fuzzy joined with above table:

id

bizname

1

Ace MG

2

ACC

3

ACB

4

Ceasar Frozen Yogurt

5

8/12 Store

6

Teradata Corporation

7

Walmart Super M

8

apple

9

Whole Foods Market

10

San Mateo City

Final Output (after the Weighted QGRAM/Cosine Similarity Workflow next section):

bizname

merchant

cos_score (or fuzzy score/confidence)

Ace MG

Ace Mortgage Inc

0.435152

ACC

Ace Credit Card Inc

0.0549411

ACB

Ace Credit Bank

0.0149411

Ceasar Frozen Yogurt

Ceasar Frozen Yogurt LLC

0.960521

8/12 Store

8/12 Convienence Store

0.566962

Teradata Corporation

Teradata Corp

0.844375

Walmart Super M

Walmart Super Market

0.890216

apple

Apple 

0.757017

Whole Foods Market

Wholefoods

0.545401

San Mateo City

San Mateo City

1.0

SQL, SQL/MR Workflow code to try in your Teradata Studio:

drop table if exists bizname_yellow_pages;
create table bizname_yellow_pages
(
 id bigint,
 merchant varchar
) distribute by hash(id);

insert into bizname_yellow_pages values (1,'Ace Mortgage Inc');
insert into bizname_yellow_pages values (2,'Ace Credit Card Inc');
insert into bizname_yellow_pages values (3,'Ace Bank Inc');
insert into bizname_yellow_pages values (4,'Ceasar Frozen Yogurt LLC');
insert into bizname_yellow_pages values (5,'8/12 Convienence Store');
insert into bizname_yellow_pages values (6,'Teradata Corp');
insert into bizname_yellow_pages values (7,'Walmart Super Market');
insert into bizname_yellow_pages values (8,'Apple');
insert into bizname_yellow_pages values (9,'Wholefoods');
insert into bizname_yellow_pages values (10,'San Mateo City');

drop table if exists operational_biznames;
create table operational_biznames
(
 id bigint,
 bizname varchar
) distribute by hash(id);

insert into operational_biznames values(1,'Ace MG');
insert into operational_biznames values(2,'ACC');
insert into operational_biznames values(3,'ACB');
insert into operational_biznames values(4,'Ceasar Frozen Yogurt');
insert into operational_biznames values(5,'8/12 Store');
insert into operational_biznames values(6,'Teradata Corporation');
insert into operational_biznames values(7,'Walmart Super M');
insert into operational_biznames values(8,'apple');
insert into operational_biznames values(9,'Whole Foods Market');
insert into operational_biznames values(10,'San Mateo City');

drop table if exists qgram_yp;
create table qgram_yp distribute by hash(qgram)
as
select id, qgram, weight
   from qgram(
       on bizname_yellow_pages
       decay(0.95)
       gram(3)
);

drop table if exists qgram_ops;
create table qgram_ops distribute by hash(qgram)
as
select id, qgram, weight
 from qgram(
     on operational_biznames
     decay(0.95)
     gram(3)
);

drop table if exists cosine_similarity_match;
create table cosine_similarity_match distribute by hash(target_id)
as
select * from VectorDistance(
   on qgram_ops as target partition by id
   on qgram_yp as ref dimension
   targetIdColumns('id')
   targetFeatureColumn('qgram')
   targetValueColumn('weight')
   measure('Cosine')
   topK(1)
);

select b.id id, b.bizname, c.merchant, 1- a.distance as cosine_score
 from
cosine_similarity_match a
   inner join operational_biznames b on (a.target_id = b.id)
   inner join bizname_yellow_pages c on (a.ref_id = c.id)
order by id
;

 

2272 Views
0 Comments

Table of contents

Previously in this series

In my last post I started exploring concrete implementation of data science pipelines with Aster and R. We covered programming environment, loading data into Aster, and importance and easiness of testing every step in the pipeline. In this post we'll focus on perhaps the most prevalent part of every workflow - data manipulation. To make examples more translatable we will embrace dual approach of illustrating each technique with 2 equivalent examples: one using  TeradataAsterR and another using dplyr package. Finally, in the spirit of embracing "test everything" each example will result in comparing them for equivalency.

The Grammar of Data Manipulation

dplyr function design follows clear separation of concern of the split-apply-combine process and the grammar of data manipulation:

  • filter: keep rows with matching conditions;
  • select: select, drop or rename attributes by name;
  • arrange: order rows by attributes;
  • mutate: add new attributes;
  • summarise: reduce multiple attribute values to a single value;
  • group_by: group data by one or more variables to apply a function.

Graph demonstrating Split-Apply-Combine.

TeradataAsterR functions are the mix of 2 approaches: classic R and split-apply-combine process embraced by dplyr. So having examples for both will add clarity and understanding to how things get done in both cases.

But before jumping to the grammar of data manipulation functions we begin with complimentary operation that combines data from two data sets into one. 

Joining Tables

Aster is a relational data store (sorry if I never made this important point before) so its support for joining tables comes naturally with fully SQL-compliant SELECT statement that includes:

  • Inner join
  • Left, right, and full outer joins
  • Semi- and anti-join (same as filtering joins in dplyr)
  • Cross join (Cartesian product)

In Aster R this translates  into two  functions that both perform joins resulting in a virtual data frame of query type and both act on two virtual data frames:

  • ta.merge with R style merge parameters that will be translated to SQL
  • ta.join that offers SQL-like parameters including choice of join type

I suggest using the latter as it offers full functionality with better transparency and compatibility with actual operations taking place in Aster.

For our examples we need to combine batting stats in Batting with players attributes from Master. This requires inner join and for additional clarity we demonstrate how to accomplish this same operation using three methods: R base function merge(), dplyr function inner_join(), and Aster R ta.join():

# R
big.df = merge(Batting, Master, by="playerID")

# dplyr
big.tbl = Batting %>%
  inner_join(Master, by="playerID")

# Aster R
big.ta = ta.join(batting.ta, master.ta, by="playerID")
ta.colnames(big.ta)[c('x.row_names','x.playerID')] = c('row_names','playerID')‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Aster R required some post-processing to remove prefixes in the attribute names (prefixed by ta.join() only to those that were found in both tables). Other than that all three look boringly similar  (if you ignore syntactic sugar of magrittr pipes. Yes, I had option of not using it but the goal was to have each example as clear as possible).

As always we test all 3 methods produced same results (notice that we ran 2 comparisons taking advantage of transitive property of the equivalency):

compareDataInRandAster(big.df, big.tbl, 
  key = c("playerID", "yearID", "stint"))

compareDataInRandAster(big.tbl, big.ta,
  key = c("playerID", "yearID", "stint"))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Note that playerID is both a primary key in Master and a foreign key in Batting which was sufficient in join. But resulting data set inherited the key from Batting that consists of 3 attributes: plyaerID, yearID, and stint.

We'll use big.xxx data sets to illustrate the grammar of data manipulation functionality as part of the pipeline workflow (we will limit examples to dplyr and TeradataAsterR functions from now on).

Transform

Feature engineering is one of top reasons new attributes get introduced in the data science pipelines and is often a key ingredient of successful models. For this purpose the grammar of data manipulation contains mutate function that adds new or changes existing attributes: dplyr keeps calling it mutate() while Aster R has ta.transform().

The following example creates two attributes:

  • key containing concatenated elements of composite table key that uniquely references each row,
  • age with player's age calculated using year of birth and current year:
# dplyr
big.prepped.1.tbl = big.tbl %>%
  mutate(key = paste(playerID, yearID, stint, sep="-"),
         age = yearID - birthYear)

# Aster R
big.prepped.1.ta = ta.transform(big.ta,
    key = paste(playerID, yearID, stint, sep="-"),
    age = yearID - birthYear)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Again, examples contain strikingly similar code for both. Still, underneath, there are some differences to be aware of. For example, while mutate() can immediately use newly introduced variables inside the call ta.transform() can't.

As before (and after as well) we test results of both functions for equivalency:

compareDataInRandAster(big.prepped.1.tbl, big.prepped.1.ta,
    key = c("key"))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Note how new attribute key being a unique reference was put immediately to work in the test.

Subset

Arguably the most common data manipulation operation is reducing data set to smaller one which may take two forms:

  • selecting attributes (vertical subset);
  • selecting rows (horizontal subset).

While dplyr uses functions select() and filter() respectively, Aster R combines conventional R to select attributes and function ta.subset() to select rows.

 We continue our example by selecting batting metrics identified by their names and stored in a variable batting_metrics while removing all rows before year 1995 (yearID >= 1995) and players without home runs (!is.na(HR) which Aster translates to IS NOT NULL operator):

batting_metrics = c("G", "AB", "R", "H", "HR", "BB", "SO")

# dplyr
big.prepped.2.tbl = big.prepped.1.tbl %>%
  select_(quote(c(key, playerID, yearID, stint, age)),
          .dots=batting_metrics) %>%
  filter(!is.na(HR) & yearID >= 1995)

# Aster R
big.prepped.2.ta = ta.subset(big.prepped.1.ta[,c("key","playerID","yearID","stint","age",
    batting_metrics, "row_names")],
    !is.na(HR) & yearID >= 1995)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

One distinct feature of dplyr is its use of non-standard evaluation (NSE, see vignette("nse") for details) which is great for translating R to SQL and interactive use but is hard to program with. For that reason dplyr offers SE versions of its functions designated with suffix _. With Aster R we use NSE inside ta.submit() but not with attribute selection that uses data frame syntax.

And our standard test comparing results in dplyr and Aster R completes this example:

compareDataInRandAster(big.prepped.2.tbl, big.prepped.2.ta, key = "key")‍‍‍‍‍‍‍‍‍‍

Summarization

Data manipulation function that reduces multiple attribute values to a single value is usually referred to as summarization. With Aster R we have at least three approaches to data summarization:

  • aggregates using SQL GROUP BY, e.g. MAX, MIN, AVG, etc. with ta.summarise() or ta.summarise.each(),
  • SQL window functions available with ta.transform(),
  • and in-database R using ta.tapply() or ta.by() that offers the most flexible  option of pushing arbitrary R code to run in parallel over partitioned data inside the Aster database.

Note that summarizing doesn't necessarily imply collapsing of data as reduced values could  be attached to the original rows (the case for window functions). What summarizing always includes is dividing operation into two steps:

  1. a grouping of data based on the values of certain attribute(s) (GROUP BY clause in SQL or INDEX argument in tapply() and ta.tapply())
  2. a computation step that computes or reduces data with aggregation, window or arbitrary R functions within each group.

In fact, dplyr embraces this concept fully by always invoking a group_by() function first followed by appropriate value reduction function.

Using SQL GROUP BY

SQL GROUP BY clause should always come to mind first. Aster R implements it using ta.summarise() that encompasses both grouping and computation, while dplyr as mentioned before invokes these steps explicitly by composing the group_by() and summarise():

# dplyr
summary.1.tbl = big.prepped.2.tbl %>%
     group_by(playerID, yearID, age) %>%
     summarise_each_(funs(sum), batting_metrics)

# Aster R
summary.1.ta = ta.summarise(big.prepped.2.ta,
     group.by = c("playerID", "yearID", "age"),
     G=sum(G), AB=sum(AB), R=sum(R), H=sum(H), HR=sum(HR),
     BB=sum(BB), SO=sum(SO)) 

compareDataInRandAster(summary.1.tbl, summary.1.ta,
     key = c("playerID","yearID"))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

In this example we consolidated the same player's records within single year into one record for simplicity. While it's not common that players change teams within the same season it does happen. And Lahman designates a record for each player's team stint, which for our purpose we consider unnecessary detail. So aggregates were used to consolidate stats to one player per year by summing up player's stats from all teams he played that season. Again, with dplyr we used SE version summarize_each_() to take advantage of existing list of all metrics in batting_metrics. With Aster R we included all aggregates explicitly in ta.summarise().

Using Window Functions

We could devote a whole blog post (or a series, or a book, easily) to window functions. But let me instead quote three sources - one from Aster, one from PostgreSQL, and lastly from dplyr vignette:

Window functions allow the calculation of aggregates and other values on a per-row basis as opposed to a per-set or per-group basis... For example, a window function may be used to compute a running sum, a moving average, a delta between values in neighboring rows, or to apply ranking and row numbering to a table.

- Teradata Aster Database User Guide

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

- PostgreSQL 9.1 Manual

A window function is a variation on an aggregation function. Where an aggregation function, like sum() and mean(), takes n inputs and return a single value, a window function returns n values. The output of a window function depends on all its input values, so window functions don’t include functions that work element-wise, like + or round(). Window functions include variations on aggregate functions, like cumsum() and cummean(), functions for ranking and ordering, like rank(), and functions for taking offsets, like lead() and lag().

- Window functions vignette

Besides its per-row application another important feature of window functions is support for order within groups, which allows for enumerating rows, computing lags, cumulative and other order-dependent functions. Thus, the syntax supports both grouping and ordering options depending on specific window function.

For our example we will compute following five values:

  • current season number for each player with row_number();
  • current career HR value (up to current year) for each player with cummax();
  • career HR for each player with max() (we could compute same with summarise() but not without collapsing groups into single row unlike with window function when each row is preserved);
  • player's career start year with min();
  • player's career end year with max().
# dplyr
summary.2.tbl = summary.1.tbl %>% group_by(playerID) %>%
  mutate(seasonsTotal=n(),
    currentSeason=with_order(order_by = yearID, fun = row_number, x = yearID),
    currentTopHR=with_order(order_by = yearID, fun = cummax, x = HR),
    topHR=max(HR), startCareerYear = min(yearID), endCareerYear = max(yearID))

# Aster R
summary.2.ta = ta.transform(summary.1.ta,
  seasonsTotal=n(partition = playerID),
  currentSeason=row_number(order = yearID, partition = playerID),
  currentTopHR=cummax(HR, order = yearID, partition = playerID),
  topHR=max(HR, partition = playerID),
  startCareerYear = min(yearID, partition=playerID),
  endCareerYear = max(yearID, partition=playerID))

compareDataInRandAster(summary.2.tbl, summary.2.ta,
  key = c("playerID", "yearID"))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Again, let's go over differences and similarities between two implementations. As noted before dplyr always divides operation into composition of group_by() and mutate() where the latter invokes window and aggregate functions inheriting grouping and retaining each row. Aster R involves ta.transform() and applies window and aggregate functions inside the call.

Using In-Database R

At last we arrived at the most powerful option available with Aster R: in-database R execution. Looking at previous examples one may think that R programming really played a limited role as there were not much R code. Indeed, our Aster R constructs closely resembled SQL or grammar of data manipulation with both SQL GROUP BY and window functions inside the grouping and aggregation (data reduction) steps. While the same constructs preside over the option of in-database R the logic transitions to R-centric paradigm based on R tapply() function and functional feature of R.

With dplyr not much changes as its functionality naturally supports functional R and doesn't leave client R environment (at least without involving extensions). On the other hand Aster R provides us with function ta.tapply() that does two important things:

  1. it expands reach of tapply() onto Aster tables by enabling parallel and distributed execution within Aster database;
  2. and it transports R code for execution from Aster R client to Aster R sandbox environments configured and available with proper installation on the Aster database.

Availability and version of R on the Aster database could be easily checked with commands:

> ta.is.R.installed()
[1] TRUE
> ta.R.Version()‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Please consult with Teradata Aster R documentation for more details on how to configure and test R in-database environment.

For our example we want to answer the following question: for each player find his age when he had a career HR season (i.e. a season when player hit most HRs in his career). For simplicity we'll use topHR attribute from the previous example when calculating player's age when he hit most HRs in a season:

# dplyr
summary.3.tbl = summary.2.tbl %>%
  group_by(playerID) %>%
  summarize(
    seasons_total = seasonsTotal[[1]],
    max_hr = topHR[[1]],
    top_hr_age = age[HR == max_hr][[1]])

# Aster R
summary.3.ta = ta.tapply(summary.2.ta, INDEX=summary.2.ta$playerID,
  FUN=function(x){c(x$seasonsTotal[[1]], # seasons_total
                    x$topHR[[1]],        # max_hr
                    x$age[x$HR == x$topHR[[1]]][[1]])}, # top_hr_age
  out.tadf = list(columns=c("playerID", "seasons_total",
                            "max_hr", "top_hr_age")))

compareDataInRandAster(summary.3.tbl, summary.3.ta, key = "playerID")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Because Aster R version does things almost letter by letter like dplyr it is advisable to review the latter first. Our goal is to calculate player's stats so summarization will reduce number of rows for each player to 1, hence, we use group_by() with summarize() construct. Next, seasons_total and max_hr have the same values within each player's group so we always pick very first value for each. Things become R-stylish in the line 7 that assigns value to top_hr_age: given vector (array) age we pick its element per condition of corresponding HR value is equal to career maximum (i.e. equal to max_hr) . With Aster R we repeat the same logic but within tapply-style function call: each player's rows are designated by the variable x with anonymous function (line 11) assigned to argument FUN. Lastly, argument out.tadf (line 14) controls how Aster saves results in the database: in this case we explicitly rename column names using argument columns.

What really happened when we called ta.tapply() is radically different from all previous examples: the anonymous function assigned to FUN was serialized and sent over network to Aster worker nodes running R sandboxes where it was deserialized and loaded for execution. After that (you may think that was just a prep) Aster read the specification on how to run it from 2 arguments: virtual data frame (1st argument) referred to underling table and INDEX identified a column value to use to partition the data (line 10). At this point Aster has all ingredients to execute anonymous R function in-database on distributed data and in parallel. It will save results into new table and returns to Aster R that was waiting for ta.tapply(). After execution in Aster completes it returns a virtual data frame with results. Lastly see compareDataInRandAster() that tests for the same results with dplyr and Aster R.

Note that we could pass any R function (including references to 3d party packages given that they are installed on Aster as well) with FUN. This method elevates Aster R to both a framework and a platform for executing R code in parallel across distributed data in a manner consistent with R tapply() function. With proper configuration and algorithm design running parallel R should become a standard function in the daily data scientist routine with Aster. 

In the next post we shift gears to show elements of data science workflow such as exploratory analysis, principal component analysis (PCA) and predictive model.

Image sources:

Red Hat Developer Program

The Programming Historian

1213 Views
0 Comments

Table of contents

Previously in this series

In my first post I explained driving force (value), defined goals (repeatable, consistent, testable, deployable, robust, scalable and functional process) and outlined structure (workflow) behind data science pipelines. In this post we'll begin examining data science pipelines using concrete examples with Teradata Aster by building workflow in R. Here I focus on connecting to data store and analytical engine (Aster), loading data, and test-driven development of workflow steps to achieve production grade code.

Teradata Aster R Philosophy

Before we dive into the details let me clarify general approach underlying R development with Aster R. Any aspiring data scientist practicing R can immediately take advantage of Aster powerful data store with RODBC and client-server style of programming. But having said that we should realize that such approach would require:

  1. learning and practicing SQL with its Aster extensions like SQL/MR and SQL/GR;
  2. or/and moving data from Aster to R memory to utilize R functions and programming

With Aster R Teradata aims at overcoming these by introducing R functions that look and behave similarly to R but act on and execute inside Aster data store by translating each call to native SQL and its extensions. Such functions are designated with prefixes ta. and aa. by Teradata. But it went even further by expanding R apply-functionality with functions like ta.apply() and ta.tapply() to push and execute R code inside Aster nodes taking full advantage of its capabilities but demanding additional configuration and advanced skills.

Teradata Aster R Environment

R programming with Aster made possible with TeradataAsterR package which is unlike other R packages available from CRAN is downloaded directly from Teradata here. Hardly surprising it requires ODBC driver for Teradata Aster installed (also available from Teradata here as part of Client Tools for your platform). For the record, installing TeradataAsterR will also install its dependency - RODBC package - that maintains data access layer with Aster database. The following diagram illustrates both the data science pipeline execution environment and the Teradata Aster database:

Assuming we already configured ODBC datasource Aster_ODBC_dsn that points to Aster instance with database my_Aster_db initializing and connecting to Aster draws 2 lines of code:

library(TeradataAsterR)

ta.connect(dsn="Aster_ODBC_dsn", database="my_Aster_db")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

From this point on TeradataAsterR implicitly maintains connection reviving and reconnecting to the database as necessary. For that reason no explicit connection object is created even though ta.connect does return one to hold on to. Upon success Aster R reports basic information about established connection:

RODBC Connection 1
Details:
   case=nochange
   DSN=Aster_ODBC_dsn
   DATABASE=my_Aster_db

Loading Data

With Aster R data stored in the Aster Database is accessed using the concept of virtual data frames (primarily as there are other types of virtual objects in Aster R that we won't discuss here). The virtual data frame concept's goal is transparent handling of data stored in Aster just as it were in R memory using Aster R functions.

In our examples we'll use various baseball datasets from the the excellent Lahman package. Our first task will be loading the datasets into Aster database, for example Master data frame with all baseball players:

library(Lahman)

ta.create(Master, table = "master", schemaName = "baseball",
          tableType = "fact", partitionKey = "playerID")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

This example uses the function ta.create() to create a table named master in the baseball schema of the Aster Database. Lahman package contains more datasets we'll use later so we added new function createBaseballData() to save us some typing:

createBaseballData <- function(data, table, schema, 
                               partitionKey, analyzeFlag) {
  t = proc.time()
  ta.dropTable(tableName = table, schemaName = schema)
  ta.create(data, table = table, schemaName = schema,
            tableType = "fact",
            partitionKey = partitionKey,
            row.names = TRUE, analyze = analyzeFlag)
  runtime = proc.time() - t
}

# Master table
createBaseballData(Master, "master", "public", "playerID", TRUE)

# Batting table
createBaseballData(Batting, "batting", "public", "playerID", TRUE)

# Fielding table
createBaseballData(Fielding, "fielding", "public", "playerID", TRUE)
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Note, that createBaseballData() is more involved: it records time it took to create each table, it makes sure the table didn't exist before, and it runs SQL ANALYZE command at the end if desired. Finally, it creates implicit row name column which would be a critical factor to let Aster R work with virtual data frames. With R we take advantage of its rich programming features and environment to create functional, robust and testable code as part of the overall workflow.

Worth mentioning the rest of functions to load data into the Aster Database:

  • ta.push() to load data from R into an existing Aster virtual data frame
  • ta.load.csv() to load data from a .csv file to an Aster virtual data frame
  • ta.reify() to materialize virtual data frame into temporary schema

Test, Test, Test

How to assure that ta.create() performed as intended that is it created exact copy of the Lahman data set in the Aster Database? The same way any program should - by running a test. For example, by comparing dimensions of the original data frame with newly created table:

# R and Lahman package
dim(Batting); dim(Master)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

[1] 101332     22

[1] 18846    26

# Aster R
batting.ta = ta.data.frame("batting")
master.ta = ta.data.frame("master")

ta.dim(batting.ta); ta.dim(master.ta)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

[1] 101332     23

[1] 18846    27

ta.data.frame() is our principal gateway to Aster R goodies - it creates a virtual data frame from an existing table, view or a SELECT query in the Aster Database - in this case we referenced tables batting and master. ta.dim() is to Aster R virtual data frames what dim() to R data frames. Each database table contains one additional column row_name created by Aster R to maintain functionality compatible with R data frames when using other data frame-like functions: 

  • ta.head()
  • ta.tail()
  • ta.show()
  • ta.length()
  • ta.names()
  • ta.ncol()
  • ta.nrow()
  • ta.colnames()
  • ta.dimnames()
  • ta.merge()
  • ta.cbind()
  • ta.rbind()
  • ta.order()
  • and more...

Our test compares dimensions between a data source (one of Lahman data sets) and its counter part table in Aster. Such rudimentary test may work ok while working on a prototype or unimportant tasks by minimizing network traffic between R and Aster. But it is less than inadequate for the production environment we target with the pipelines. To make workflow more bullet-proof one way is to bring data back from Aster into R, transform to a data frame and compare with the original: 

compareDataInRandAster <- function(df, ta, key, columns=names(df)) {

  columns = setdiff(unique(c(key, columns)), c("lgID","teamID","bats","throws"))

  data_in_R = df[, columns] %>% arrange_(.dots=setNames(key,key))

  data_from_Aster = as.data.frame(ta[ , columns],
    stringsAsFactors = FALSE) %>% arrange_(.dots=setNames(key,key))

  all.equal(data_in_R, data_from_Aster)
}

compareDataInRandAster(Batting, batting.ta,
  key = c("playerID", "yearID", "stint"))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The function compareDataInRandAster() is near production quality test that validates any Lahman data set against its image loaded into Aster table. It utilizes:

  • package dplyr to order data (for convenience, we could use standard order() instead, of course) 
  • Aster R as.data.frame() to convert virtual data frame to R data frame by moving data from Aster into R memory
  • all.equal() to validate that original data set and its image brought back from the database are exactly the same.

With R programming language building tests (using package like testthat) should be an integral part of your development process and with data science pipelines we seize this opportunity. In particular, replace all.equal() with expect_equal() to make compareDataInRandAster() compatible with the rest of the testthat suite.

In the next post we will review data manipulation techniques available with Aster R as part of data science pipeline and compare them with equivalent functions from dplyr package.

Resources

Image source: Red Hat Developer Program

464 Views
0 Comments

data science is a team sport - from finance to analytics

Just recently Ron Powell, independent analyst and expert with BeyeNETWORK interviewed a Senior Data Scientists of the Aster Strategy and Adoption Team at Teradata.  They discussed the importance of IT, business and data science working together to meet the needs of the business.

In this podcast, you'll hear discussion about the transition from finance to analytics.  And, from a data scientist perspective, you'll hear the challenges the modern organization silos have in working together.  Also, you'll hear discussion about the integration of open source R with Teradata Aster and how different silos show up in this area.

If you can spare 8 minutes and listen in on the conversation:

  Video Link : 1080

 

Leave your comments and questions below and we can continue this conversation.

 - Aster Community Team


Related Links:

Podcast:  Data Science is a Team Sport - From Finance to Analytics 

Podcast Script: The Transition from Finance to Analytics by Ron Powell - BeyeNETWORK

Read more Aster Community Blog Posts:

Blogs in the Learn Aster Community 

  • Aster 101
490 Views
0 Comments

Centrality as a concept is largely subjective because it answers the question of what nodes are most important in a network.  Importance is heavily dependent on what question you're trying to answer: if you want to know who has the most connections in a network, you need to only count the degrees.  If you're trying to understand information transmission, Betweenness is great.  For cases where you want to understand the ability to transmit information over time and given specific nodal properties, however, you need something like Percolation Centrality (PC) - which is today's topic of discussion!

I want to talk about Betweenness first, as it's the basis for PC.  Betweenness as a concept assumes that information will travel over the "path of least resistance" - If your mother-in-law knows a joke, you're more likely to hear it from her than you are from the Pope - who may have told his friend, who tells his gardener, who tells her cousin, who tells her accountant, who finally tells you.  Likewise, Betweenness is calculated as the proportion of shortest paths that go through any given node.  If we make the assumption discussed above regarding shortest paths, then the proportion should describe the relative control a node has over the transmission of information, in this case, the telling of a joke.

The question then becomes: how does this control change over time, and given nodal states?  If your mother-in-law doesn't know the joke, she can't tell you the joke.  Therefore, her ability to transmit information (the joke) in this case would be 0.  This ability is something I'll refer to as a nodal state.  This is the one of the key additions that PC adds over Betweenness Centrality.  At each point in time, PC is computed as the proportion of percolated paths that go through a given node (where percolated paths refer to paths that originate from a percolated source).  If you're interested in learning more about the mathematical foundation of Percolation Centrality, please refer to the PLoS One publication by Piraveenan et. al found here

As part of my M.S. thesis, I developed logic to parallelize the computation of Percolation Centrality, and extended Aster's Betweennness code to accommodate the changes necessary.  The key files are attached to this post, however, if dependencies are needed, please contact Mike Rote (Mike.Rote@Teradata.com) for the complete software package.

The Percolation Centrality function takes the same arguments as the Betweenness function, with the exception that the "source" input is mandatory, and should be a table of "percolated" nodes (node ID is the only required column).  More information regarding the Aster implementation can be found in my thesis document ("Bulk Synchronous Parallel Implementation of Percolation Centrality for Large Scale Graphs", which will be published through OhioLink.

Percolation Centrality can be used in applications from diverse fields.  Examples include:

  • Epidemiology (identifying high-risk contagion points in a social)
  • Marketing (identifying potential recruiters in a network)
  • Transportation (issue tracking)

Cheers!

460 Views
0 Comments

When analyzing a time series data set we sometimes want to detect those points in time where there is a significant and abrupt change.  

Aster offers a ChangePointDetection function that does exactly that. The function looks back at the available data points and applies a binary segmentation search method. The algorithm executes these key steps:   

  1. Find the first change point in our time series.
  2. From that point, split the data into two parts.
  3. In each part find the change point with the minimum loss (as calculated by a cost function).
  4. Repeat until we have found all the change points.

Before we can learn more about this function we need a data set to explore. We can download the Online Retail Data Set from the UCI Machine Learning repository (link). 

Let's load the csv data into a new Aster table "retail_sales_cpd" and review an example.

Our data set includes 541,909 rows. We pick one sample customer and product:

In the output we see that a customer from the Netherlands tends to place very large orders for vintage spaceboy lunch boxes. The price is very static, except for one order.  

The quantity varies wildly. We see significant up and down changes (red boxes  throughout the order history.

Of course with large data sets we do not have time to manually sift through the data and create visual plots. Let's review what the ChangePointDetection function can do for us. 

Function Syntax:

Besides the normal function parameters there are a few additional parameters that we need to study more carefully:

  • We will partition the input data by customer and product and sort using the invoice date.  
  • The ValueColumn is the key field of interest where we want to detect changes. For our dataset we can pick qty, price or qty *price.  Note that we can only specify one single column.
  • Accumulate is where we specify the identifying columns that we used in the partition and order by clause (such as customer_id, product_id and invoice_date). These extra output columns will help us interpret the results.
  • SegmentationMethod allows us to choose normal_distribution or linear_regression. (default = normal_distribution) 
  • SearchMethod is always set to binary. This is the only option for the function. We do not have to explicitly specify this parameter for that reason.
  • MaxChangeNum specifies the maximum number of changepoints to detect (default = 10)
  • Penalty can be BIC, AIC or a specific static threshold.  The BIC and AIC criteria are used to evaluate the differences between the chosen change points are the original data. The penalty is included in the cost function as a guard for overfitting.  BIC is the default option.
  • OutputOption can be CHANGEPOINT, VERBOSE or SEGMENT. This option allows us to output only the
    • changepoints (cptid column)  This the default option.
    • changepoints and calculated differences  (between the estimations for the H1 and H0 hypotheses)
    • specific segments that have been detected. 

We invoke the ChangePointDetection function and use linear regression to perform the segmentation:

Note that while we can use the ACCUMULATE feature to output additional columns, I prefer to join with the source table to get a full picture.

Reviewing our basic line chart again. If we circle higher qty change points in red and lower qty change points in green we get this result:

Obviously the change points do not always correspond with straightforward highs and lows. If they did we would not need to have the function do all the calculations. A simple sql windowing approach could accomplish the same.

Change detection on retail data can highlight those customers that have unique requirements and shopping habits. Possibly this group of customers is at higher risk of churn or lower satisfaction and it is a good idea to perform further analysis using other techniques.

To quickly find those customers of interest and products with a higher number of change points we can aggregate our results. 

Since our example is using a retail data set one question comes to mind: does seasonality impact the results? Yes, change detection algorithms do have a harder time with time series that include seasonality. It is recommended to remove the seasonal component if your results are below expectations.

So what is the value then?

In our example we reviewed retail sales using the quantity sold.  We can apply the same technique to averages, counts, standard deviations. This opens the door to various use cases such as fraud or intruder and anomaly detection where a tangible corrective action is possible.

Another example could be a rise in call center complaints. A change detection analysis can pinpoint the time where one or more events triggered the increase in call volume.  

In manufacturing the strength of a part is affected by a change in the input materials.  

Image result for manufacturing part defect

Change point detection can go back in time , go through the historical sensor data and highlight the time stamps where changes occurred. Those time stamps can potentially be linked to a supplier switch,  different batch of input materials or a change in the operating environment.   

References

Online Retail Data Set (link)

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

Change Point Detection: a powerful new tool for detecting changes (link)

Change Point Detection with seasonal time series (link)

Bloggers
Top Kudoed Authors

Data Science Informative Articles and Blog Posts

Our blogs allows customers, prospects, partners, third-party influencers and analysts to share thoughts on a range of product and industry topics. Also, we have lots of content in the community; allowing you to gain valuable insights from Teradata data scientists.