What Is This Data Set?

Learn Data Science
Teradata Employee

The fable The Blind Men and the Elephant teaches us that looking at something from only one perspective is not conducive to understanding the big picture.

The Blind Men and the ElephantThe same is absolutely true for your datasets! In this post we will look at clickstream data through the lens of several genres of analytics in an attempt to fully understand the big picture. This thought experiment is a beginning phase of the Multi-Genre Analytics approach – we are looking at the data set with multiple genres but not yet combining them to solve a specific business problem. Although there are many genres of analytics we will focus on five that are easily visualized and great for providing quick insights about the big picture.

Data as a Table

Web Page
4/1/09 1:29 PM
4/1/09 1:32 PM
4/1/09 1:34 PM
4/1/09 1:38 PM
4/1/09 3:41 PM
4/1/09 3:43 PM
4/2/09 5:43 AM
4/2/09 5:47 AM
4/2/09 5:50 AM

We are using the data set of roughly 3M rows about customer browsing behavior. Each row has a session, a web page, and a time stamp. By viewing the data as a table we can easily see the type of data we're working with and begin to do some basic business reporting. By using aggregates we can answer simple statistic questions like 'How many sessions were there per month?' and 'What are the most popular web pages?'

Data as a Path

Path View of Click Stream Data

We next view our data as a path to see how customers are navigating the website. In the above picture we have the 100 most common paths in this data set. Some common behaviors quickly jump out

  • It is equally likely for customers to start navigating from the home page or from a specific view product page. We now know that many of our customers are finding us by getting links of products from friends or bookmarking a product to view later.
  • A prominent path is multiple search results in a row. Assuming that there is a search bar at the top of our search results page this means that it is common for customers to be unsatisfied with the results of their search and to try searching again. We see that maybe it's time to do an analytic project about on site search.
  • It is common for customers to leave the website after viewing a product.
  • If customers checkout it will more likely be towards the start of their session. It is less common for customers to browse the site and still make a checkout that session. 

Now that we have an idea of what's going on we could refine the path that we're looking for to gain further insights. 

Data as a Tree

Viewing the data as a tree gives us an unaggregated view of the common ways that customers move about the website. Like in path, we see that customers start with either the home page or a view product page. With this view we can then see if there is any differences in behavior later on in the browsing session.

Trees are also great for understanding behavior that isn't linear. Online shopping is a great example since it is common for customers to hold the Ctrl key and open multiple tabs from a search results page. Linearly it would look like the customer viewed an item after viewing another item. Hierarchically we know that those items are being viewed at the same time with customers even jumping back and forth between pages. 

Data as a Graph

Graphs allow us to understand the relationships happening in our data space. Here we are looking at how customers move from one page to another or how the various pages are related. From this graph we can see that

  • Search results is the predominate webpage and customers often do repeated searches before viewing an item
  • The view product page is the center of our website. If customers can't view products then they will not be making purchases.
  • Some customers are going from the home page to a specific product. It seems as if advertising is working, there is need here for further analysis to see if these products are being purchased.

We might also look at relationship between customers and items bought to see if there is a trend between the type of customer and the type of item. 

Data as Text

Although is this a text-sparse data set there are some insights that are clear here and not when we view the data from another genre. In path and graph we saw that search results was the most prominate page, but when we look at sheer volume view product stands out. This means that there are many customers who are coming directly to a product page and then doing nothing else. We might look at those specific product pages to determine a cause for this. Do they have less of a product description? Is the checkout button not working here? What might be causing this sparse behavior. 


We started by knowing very little about a click stream data set. By looking at the data with visualizations from several genres of analytics we were able to get a better understand of how customers are experiencing this specific website. We gained new insights from each type of analytic and by the end were less likely to make poor assumptions about what was going on. This gave us clear questions to investigate further and insights on common business questions like 'How can we improve online sales?'.