An Analytic Workflow in Aster R

Learn Data Science

Introduction

Are you here because you're wondering what's the buzz (about Aster R) and you want somebody to tell you what's happening? Or maybe you're here because you need to run analytics on billions of rows stored in an Aster database. "I want to use R", you say, "but R is not built for Big Data! The volume of data I can process is limited by memory and R is single threaded! On top of that, complex analytics on large datasets can run slow!", you pause as a tear runs down your cheek, "or not at all". You curse your misfortune. "I DON'T WANT TO USE SQL!!!", you shout into the void. But this time, the void replies. "There", it whispers, "is an R package. It lets you use R code to process your data and run analytics on an Aster database. You can also use it to pull data from and push data to an Aster Database. It's called Aster R. Read Roger Fried's excellent post to learn more about what Aster R is and how it works."

In this post we'll be using the k-means clustering algorithm to group Texas zip codes by average energy usage (in kwh) per month. To learn more about the k-means algorithm, see this video from John Thuma.

The goal of this demonstration is to layer the k-means clusters on top of a map of Texas (geography). One possible application of this would be in marketing and churn models. The consumption based zip code feature would allow a business to create zip code based marketing campaigns targeting use based products. From the churn perspective, a business could determine if certain zip codes with usage patterns have a high churn rate and then further diagnose the problem layering it with other features.

The Aster R package makes it easy to write parallelized R code, making it possible for our entire analytic workflow to be done in R - from loading data to making visualizations.

Workflow

We'll start by loading our Aster tables into R as virtual data frames (which are just references to the tables - we aren't moving any data). Then we'll use some functions from the Aster R package to prepare our tables for k-means Clustering. Once our tables are ready, we'll copy our (much smaller) processed tables into R as an R data frame and run k-means on that table. Finally, we will visualize our results as a map of Texas where the zip codes are colored by cluster.

workflow.PNG

Aster R Example

Required R Packages

require("TeradataAsterR")

install.packages("rgdal")

install.packages("dplyr")

require("rgdal")

require("dplyr")

Connecting To The Aster Database

Check out Roger Fried's post again to see how to connect to your Aster Database through Aster R.

Loading Sample Data Into Aster

You can find the data here.

FYI: This is a sample dataset - not actual customer data. It is somewhat randomly generated based on general industry usage trends. The importance of this demonstration is the process of using Aster R, not to gain any specific insights into these datasets.

Create a new schema, called 'energy_demo_tables' in the Aster database you're using.

We're going to load our data through Aster R. As always, we're going to create some dummy tables first. These tables will be called energy_demo_tables.meter_data and energy_demo_tables.accounts.

Let's start with the accounts table.

ta.dropTable("accounts", schemaName = "energy_demo_tables")

meter_id <- character()

zip_code <- character()

city <- character()

state <- character()

ta.create(data.frame(meter_id, zip_code, city, state), table="accounts", schema="energy_demo_tables", tableType="fact", partitionKey="meter_id")

The ta.dropTable() fuction is equivalent to DROP TABLE IF EXISTS in SQL - if the table already exists, then it will get dropped. This helps us avoid errors if you already created the table but need to backtrack. The ta.create() function creates a table on the Aster database. If you wish to know more about what a function I'm using does then use the ? command (?ta.dropTable, ?ta.create, etc) to pull up the documentation.

Next, we'll do the meter_data table.

ta.dropTable("meter_data", schemaName = "energy_demo_tables")

meter_id <- character()

reading_timestamp <- character()

kwh <- numeric()

ta.create(data.frame(meter_id, reading_timestamp, kwh), table="meter_data", schema="energy_demo_tables", tableType="fact", partitionKey="meter_id")

Here is the SQL equivalent of what we just did:

DROP TABLE IF EXISTS energy_input_tables.accounts;

CREATE TABLE energy_input_tables.accounts (

    meter_id varchar

    ,zip_code decimal

    ,city varchar

    ,state varchar

) DISTRIBUTE BY HASH(meter_id);

DROP TABLE IF EXISTS energy_input_tables.meter_data;

CREATE TABLE energy_input_tables.meter_data (

    meter_id varchar

    ,reading_timestamp varchar

    ,kwh decimal

) DISTRIBUTE BY HASH(meter_id);

       

Compare both methods - it will help you understand the loading process a lot better!

Now, we will load the accounts.csv file into the accounts table and the meter_data.csv file into the meter_data table. And create tadf references to each. Depending on your connection, you may want to plug in your ethernet cord and kick back with a cup of tea and your favorite Aster Manual.

Make sure you update the path to reflect where you downloaded the data. Remember, R uses '/' for directories, not '\'.

tadf_accounts <- ta.read.csv(ta.data.frame("accounts", schemaName = "energy_demo_tables"), "C:/Users/TD186015/Documents/Aster Demo Data Files/accounts.csv")

tadf_meter_data <- ta.read.csv(ta.data.frame("meter_data", schemaName = "energy_demo_tables"), "C:/Users/TD186015/Documents/Aster Demo Data Files/meter_data.csv")

Sit tight! This will take some time.

Let's warm up and get a feel for what our data looks like with a few basic TeradataAsterR commands!

ta.head(tadf_accounts, n = 2)

ta.head(tadf_meter_data, n = 2)

ta.colMeans(tadf_meter_data$kwh)

ta.colMaxs(tadf_meter_data$kwh)

ta.colMins(tadf_meter_data$kwh)

Notice that we can use $ symbol, just like with regular Data Frames, to access specific variables (attributes, columns, call them what you want!) within a data frame.

If you see this output then your data probably loaded correctly:

meter_id zip_code city state

1 2391896 79951 El Paso TX

2 4578095 77092 Houston TX

meter_id reading_timestamp kwh

1 2489659 2015-10-14 140.0326

2 2085400 2015-10-14 120.2228

Preprocessing

One of the neat features of Aster R is that you can use it execute SQL queries on tables stored in Aster.

Remember that our goal is to cluster zip codes by average power consumption per month and then plot the clusters on a map. Right now our data is two separate tables and not aggregated by zip code. The goal of this stage is to join our two tables together (the common key is meter_id) and aggregate the resulting table so we have a table containing the month's average power consumption (kwh) for each zip code.

First, we use an inner join to join both tables by meter_id / sp_id and store the results in the td186015.joined_all_rows table.

ta.dropTable("joined_all_rows", schemaName = "energy_demo_tables")

ta.create(ta.join(tadf_accounts, tadf_meter_data, by = c("meter_id" = "meter_id")), table = "joined_all_rows", schemaName = "energy_demo_tables", tableType = "fact", partitionKey = "zip_code")

tadf_joined_all_rows <- ta.data.frame("joined_all_rows", sourceType = "table", schemaName = "energy_demo_tables")

Are you disappointed how straightforward this is? Too bad! You'll just have to get used to it!

this is progress! We only have one table - but our data is still too granular. Let's aggregate! I'll make life easier for ourselves when we use k-means. Since we are most interested in power usage by zip codes, how about we create a new tadf containing the zip_code and kwh columns? Seems like a good place to start.

tadf_zip_kwh_to_transform <- tadf_joined_all_rows[,c("zip_code", "kwh")]

We'll use the ta.transform() and ta.unique() mfunction to aggregate. Yes, it looks clunkier than SQL - but this is a quick way to aggregate large datasets in Aster R. The transform() function in the nested Aster R code below is equivalent to the following window function in SQL:

MEAN(x) OVER (

  PARTITION BY partition

  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

      

While the unique()function returns unique rows.

ta.dropTable("zip_kwh_transformed", schemaName = "energy_demo_tables")

ta.create(ta.unique(ta.transform(tadf_zip_kwh_to_transform, kwh = mean(kwh, partition = zip_code))), table = "zip_kwh_transformed", schemaName = "energy_demo_tables", tableType = "fact", partitionKey = "zip_code")

tadf_zip_kwh_transformed <- ta.data.frame("zip_kwh_transformed", sourceType = "table", schemaName = "energy_demo_tables")

Check out Alexandre Vilchek's article for a more advanced method of writing parallel aggregation code.

....And that's it for preprocessing!

k-means clustering

Remember when we had millions of those rows? Seems like a dream now...

Well now our data set has as many rows as zip codes in Texas, which is not too much - relatively speaking. We can quickly move our data out of Aster into an R data frame and do some k-means clustering.

zip_kwh <- data.frame(tadf_zip_kwh_transformed)

Look how nicely Aster R data structures play with R data structures!!

Now that we've made our big data much, much smaller, we can use the regular R kmeans() function to do our clustering.

This code runs the kmeans() function and creates a new data frame called my_clusters where each zip code is assigned a cluster. We also rename the cluster column to clusterid.

pivot_centroid <- kmeans(zip_kwh, centers = 4, iter.max = 10)

my_clusters <- data.frame(zip_kwh, pivot_centroid$cluster)

colnames(my_clusters)[3] <- "clusterid"

Wow. Let's take our sweaty, clammy hands off the keyboard and reflect on the time when massive amounts of data were an impediment to doing data analysis in R!

Visualizing

We have zip codes, we have clusters. How would these zip codes look if we plotted them on a map?

Well.... first we would need a map.

Downloading the Geospatial Data

You can download a map of USA zip codes here.

Click "Download the 2014 ZIP Code Tabulation Areas (ZCTAs) Boundary File"

Now, drag the file into your R project directory you created earlier. Run the following command to unzip the file:

unzip("cb_2014_us_zcta510_500k.zip", exdir = "zip_codes")

This will unzip the file into a new directory in your workspace called zip_codes.

Loading Geospatial Data

To manipulate maps in R, we need to first find a way to load spatial data. In this document, we will load our geographic data from a shapefile using the readOGR() function from the rgdal package.

In this case, the dsn is different from the dsn when you use ta.connect(). Dsn here only refers to the directory the shape file is in. The layer refers to the handle of the shapefile.

zip_map <- readOGR(dsn = "zip_codes", layer = "cb_2014_us_zcta510_500k")

The zip_map object is an instance of the "SpatialPolygonsDataFrame" class. Let's get a feel for what this object is like by running the following commands one by one:

head(zip_map, n = 1)

head(zip_map@data, n = 1)

plot(zip_map)

The ZCTA5CE10 column contains the zip codes. Let's rename the ZCTA5CE10 column to something a little easier on the fingers.

colnames(zip_map@data)[1] <- "zip_code"

Preparing ShapeFile For Visualization

As we saw earlier, our friendly neighborhood object, zip_map, is a map of every zip code in the country. We only want zip_codes in Texas, so let's subset this bad boy. Since (almost all) Texan zip codes have prefixes between 749 and 800, we only care about zip codes with those prefixes.

texas <- zip_map[as.numeric(substring(as.character(zip_map$zip_code), 0, 3)) > 749 & as.numeric(substring(as.character(zip_map$zip_code), 0, 3)) < 800, ]

Let's make sure we subsetted Texas correctly.

plot(texas)

Looks like Texas to me.

Preparing Data For Visualization

To plot our clusters on a map, we need to join the output from running KMeans to our shiny new object, texas. More specifically, we want to see which zip_codes are in which clusters. inner_join() is a function from the dplyr package. Make sure you installed dplyr and loaded the package like I showed you above.

texas@data <- inner_join(texas@data, my_clusters)

Drat!!! An error! But don't worry, Aster fans, I haven't led you astray. If you check the class of texas$zip_code, you'll see that it's a factor. Factors are R's implementation of categorical variables. For our purposes, however, we don't want our zip_codes to be categorical variables. We want them to be numeric. We can do this with the following code:

texas$zip_code <- as.numeric(as.character(texas$zip_code))

Ok, let's run that code again:

texas@data <- inner_join(texas@data, my_clusters)

Use the head() command like we did above to check that the data joined properly. Remember to reference the correct slot.

Plot

Now, all we have left to do is plot!

plot(texas[texas$clusterid == 1, ], col = "#e41a1c", add = TRUE) + plot(texas[texas$clusterid == 2, ], col = "#377eb8", add = TRUE) + plot(texas[texas$clusterid == 3, ], col = "#4daf4a", add = TRUE)+ title(main = "Zip Codes Clustered By Energy Usage")

The resulting map:

picture_for_blog.PNG

You can see how each Zip Code is assigned a color based on the cluster it belongs to. Note that as KMeans randomly places the centroids, your picture may be different from mine. That is OK!

Conclusion

We saw how using R and Aster together lets you do scalable analytics on all of your data - whether it is big, small, relational, or geographical. In this case, we used Aster R to find patterns in use-based relational data and layer the results on a map. We showed earlier how these results could be used to build marketing and churn models. It's not hard to imagine the value you could gain with other types of data and methods. For example, you grouped transcripts of customer service calls to your company by zip codes. Instead of clustering by usage, you could perform sentiment analysis to find the zip codes where customers are satisfied or unsatisfied. From a churn perspective, you could see if 'unhappy' zip codes have high churn rates. Alternatively, you could see if 'happy' zip codes have low churn rates.

Thanks to the scalability of Aster R, slight adjustments to the method in this post are all that is necessary to tackle a whole other type of problem with completely different data.