Using Aster Express: Act 3, On the Road to nPath

Aster
Teradata Aster is an analytic platform that embeds MapReduce analytic processing with data stores. •Embedded MapReduce analytic processing and unique SQL-MapReduce® framework •Massively parallel data store for multistructured data •Intuitive tools and SQL-MapReduce libraries for rapid analytic development
Teradata Employee

Using Aster Express: Act 3, On the Road to nPath

The foundations of the Aster platform are a parallel database with all the features you'd expect from a SQL platform.  However the power of the Aster platform really shines once you start using its library of SQL-MR analytic components.  With these components, complex data analysis that is difficult to code in SQL becomes much, much easier.  A clear example of this can be seen with Aster's nPath module.

Consider the example dataset, bank_web_clicks, that we loaded in our prior article, Using Aster Express: Act 2.  This fairly simple data set contains web log event data of users' web page visits to a fictional bank web site.  The table fields in this data are:

customer_id
session_id
page
datestamp

For those unfamiliar with web logs, 'sessionizing' refers to grouping a customer’s web clicks into distinct viewing sessions based upon the time of the page view.  Page visits that are close together in time are considered a session.  When there is a sizeable break in time, for example a break greater than an hour, a new session is created.  Think of your own browsing habits.  You may visit a site regularly during the day, but your morning visit is considered a separate session from your afternoon or evening visits.

In our data table, bank_web_clicks, the web log records have already been parsed, and the sessions have already been defined with the field ‘session_id’.  For example, let's look at the data for a single customer with customer_id = 32.

SELECT *
FROM bank_web_clicks
WHERE customer_id = 32;

From the results of this SQL query we can see that customer 32 has visited our site during 8 individual sessions (only 5 of the sessions appear in this screen shot). 

A query like this is a simple metric that we can use to capture information about our customers; for example,  to see who are the most frequent visitors to our website.

(note the LIMIT 10 syntax that keeps the output manageable - there are a million rows in this table!):

SELECT customer_id, count(session_id)
FROM bank_web_clicks
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

OK, so far these are pretty standard SQL queries.  Yes, there is a lot of customer viewing behavior that we can derive from our web logs using these types of SQL queries.  But for more complex analytics, its time to head toward Aster’s powerful analytics libraries and dive into SQL-MapReduce using Aster’s nPath module.

nPath – MapReduce Analytics using SQL Queries

With nPath we can start asking more complex questions like "what are the common paths that our customers take on our web site", or even focus it more to key web transactions like "what path did the customers take that lead them to making a purchase on-line".  These queries need to find patterns or relationships between the rows in the table.  These types of queries, especially Time Series types of queries, become difficult with standard SQL.  Relational databases are great tools when the data model maps the relationships between objects or tables.  But when the relationships are between rows in the same table, well, that's another story.

Let's take a look at our bank_web_clicks table for an example.  So how do we find the customer's path through the web site?  For starters, we'd need to order the page visits by time.  We can do that with the page visit timestamp field in our table (which is what was used to create the session groupings).  But what would that SQL look like?  It would certianly have quite a few 'MAX(datestamp) < datestamp' iterations for each level deep that we wanted to query.  Getting the 2nd page wouldn't be too bad, but the 3rd in the series, or the 4th or 5th - now that would be some complex SQL query!

Now with Aster's nPath module, it is much more straight-forward and simple query to write.  Let's build one here so that we can take a closer look.

First, we start with a SELECT statement that calls nPath.  The parameters that we'll pass to nPath tell it which table to use to pull data, as well as how to partition that data and which patterns we want to analyze.

We’ll start again by looking at the paths taken by our favorite customer (customer_ID=32). With nPath, we are analyzing the data to find common paths.  If we know the end point, its easy to understand the query by thinking of looking backward at the steps that led up to that end point.  In this example, as in many nPath examples, we'll call the end point our 'conversion event', which in this case is the BILL MANAGER ENROLLMENT page. 

While the nPath SQL below does have a little bit of a learning curve, the first thing that should be easily recognized is that this does look like a 'standard' SQL statement.  Take a look at the nPath query below and at the inputs that we are passing to it.  You can see that we are passing the table bank_web_clicks as the input records.  We also are definining an alias named 'CONVERSION' as our endpoint, the 'BILL MANAGER ENROLLMENT' page.  All other pages are aliased as 'PAGE'.  Another key syntax to take a closer look at is the 'PATTERN' statement.  Here we are telling nPath that we are looking for all pages ('PAGE+') that lead to our CONVERSION step; PATTERN('PAGE+.CONVERSION').

SELECT customer_id, session_id, path 
FROM nPath (
ON bank_web_clicks
PARTITION BY customer_id, session_id
ORDER BY datestamp
MODE (NONOVERLAPPING)
PATTERN ('PAGE+.CONVERSION')
SYMBOLS (
page <> 'BILL MANAGER ENROLLMENT' AS PAGE,
page = 'BILL MANAGER ENROLLMENT' AS CONVERSION
)
RESULT (
FIRST (customer_id OF PAGE) AS customer_id,
FIRST (session_id OF PAGE) AS session_id,
ACCUMULATE (page OF ANY (PAGE, CONVERSION)) AS path
)
) n
WHERE customer_id = 32
ORDER BY n.customer_id, n.session_id
LIMIT 20;

In these results we see that customer_32 did enroll in the Bill Manager program during his 4th web session.  And the path that he/she took was first the Account Summary page, then the Bill Manager Form, and finally the Bill Manager Enrollment page.  Cool!

Now what about the other customers?  Well, its "just SQL", go ahead and remove the WHERE clause with customer_id = 32 and rerun the query.  That will show all customers who have joined the Bill Manager program, along with the paths that they took through the web site that lead to that action/conversion.

Or what if we wanted to know the most common paths that customer were taking through our web site?  This might be good information to help us design the site better to encourage customer behaviour.  Let's make another small change to our "just SQL" query:

SELECT DISTINCT(path),COUNT(*) 
FROM nPath (
ON bank_web_clicks
PARTITION BY customer_id, session_id
ORDER BY datestamp
MODE (NONOVERLAPPING)
PATTERN ('PAGE+.CONVERSION')
SYMBOLS (
page <> 'BILL MANAGER ENROLLMENT' AS PAGE,
page = 'BILL MANAGER ENROLLMENT' AS CONVERSION
)
RESULT (
FIRST (customer_id OF PAGE) AS customer_id,
FIRST (session_id OF PAGE) AS session_id,
ACCUMULATE (page OF ANY (PAGE, CONVERSION)) AS path
)
) n
GROUP BY 1
ORDER by 2 DESC
LIMIT 20;

With the addition to our SQL query of a simple DISTINCT and count/GROUP BY, we have some great info on how customers are using the web site.  Now this Aster-nPath is starting to look like a pretty powerful tool.

Hopefully that gives you a good peek into the power of Aster's SQL-MR library, especially the nPath module.  More examples will be coming soon.

In the meanwhile, have fun experimenting with these examples.  We've also published a guide to the Aster nPath syntax that should be very helpful as you start building your own queries against your data, Aster nPath Guide.

Have fun!

Tags (2)
4 REPLIES
Teradata Employee

Re: Using Aster Express: Act 3, On the Road to nPath

Thanks! I like it!
Eager to get more examples to work on. Copy/paste on the last one gave an error though :-)
ERROR: syntax error at or near "DESCLIMIT"
But it was easy to fix...
Teradata Employee

Re: Using Aster Express: Act 3, On the Road to nPath

Hi All,
I got the follow error in using "load_from_teradata" function:

My SQL statement:

CREATE FACT TABLE empty(
UserId int
)
DISTRIBUTE BY HASH(UserId);

SELECT *
FROM load_from_teradata(
ON empty
TDPID('dbccop1') -- Teradata hostname
USERNAME('dbc')
PASSWORD('dbc')
QUERY('SELECT csum(1,1) as UserId, AccountName, RowType from Accounts')
);

error output:
*******************************************************
5/23/2012 16:27:35
SQLA Version: 13.10.0.3
System.Data.Odbc.OdbcException
ERROR [HY000] [AsterData][nCluster] ERROR: load_from_teradata must be invoked on an empty, partitioned table ()
? System.Data.Odbc.OdbcConnection.HandleError(OdbcHandle hrHandle, RetCode retcode)
? System.Data.Odbc.OdbcDataReader.Read()
? System.Data.Odbc.OdbcDataReader.get_HasRows()
? Teradata.SQLA.RunQry.DisplayResults(DbDataReader rdr, String stmt, Int32& rowCnt) ??
F:\ttu1310_efix_snap\tdcli\qman\sqla\RunQry.vb:?? 465
*******************************************************

Anyone can help me? Thanks.

N/A

Re: Using Aster Express: Act 3, On the Road to nPath

I am using the downloaded image on vmware.Can we increase the number of v-workers on it for the node? Right now i can see only one v-worker for my worker node.

I also tried adding one more node.Its state is passive.I am trying to check the performance improvements by increasing v-workers and/or worker nodes.

Can anyone help me?

Waiting for ACT-4/5...

Thanks.
Teradata Employee

Re: Using Aster Express: Act 3, On the Road to nPath

We just posted an article to show how to increase the number of v-workers: http://developer.teradata.com/node/11054.
This is also a required step in order to add more Worker nodes and also to use the Teradata-Aster connector.