Data Science - Where Aggregations Taint Insights

Learn Data Science
Teradata Employee

I intended this blog post for readers who are transitioning from a BI background to the Data Science world. I was googling on this topic earlier before started writing this blog and it was great to see that more folks have started discussing this. This blog is my spin with more examples.

I love BI - having created tons of reports using MSTR, SSRS, Business Objects for the companies I worked before (including Aster Data). Was able to add value with  the business benefiting enormously from reporting trends, KPI etc.,  - until I found a few years ago on what I was missing.

Hindsight - Reporting is *VERY DIFFERENT* from Insights - Period. We've just transitioned from What to Why with Data Science help over the last few years - How still remains in play. It's a big paradigm shift.

One doesn't have to go find exotic examples of proving this. Let's look at the web clickstream data. With technologies like Teradata Aster you can sessionize billions of rows of apache logs really quickly. See video on what Sessionize does if you are not familiar with the concept:

For the impatient, Sessionize breaks a web clickstream into visits with ids. A visit id gets assigned to a bunch of sequential page view data that happens in close proximity to each other without a big pause between consecutive page views. If a next page view occurs after a big timeout after the last page view, it gets assigned a new visit or session id.

Sessionized data is super useful to understand the user 'intent' behind a burst of activity that occurs in close proximity on the web site. You can use sessionized data to look at things like session or visit duration, page dwell time, sequence of page views to determine intent etc., and even score for intent. You can create all kinds of BI reports around the sessionized data using standard aggregations like Averages, Counts, Count Distincts, Cubify with multiple dimensions (aka segmentation) etc., Slap the measures on a Tableau, Microstrategy Dashboard right ?

Average Session or Visit Duration:

Let's look at Average Session Duration metric. Let's say we measure it as 2.5 min for a whole week of usage by using the AVERAGE() function. If we report this as a KPI in a slide, the first automatic inference the business would get is that 2.5 min is most "central" measure and most session durations are somehow spread evenly around this number (more or less).

Here's the actual distribution of the Session Duration from some sample web logs. Basically a plot of session duration (min) vs # of sessions for each minute bucket.


You can see that a majority of sessions were less than 3 min and only a long tail of sessions that had >  the 3 min duration !! In fact the shorter sessions less than 3 minutes were probably not so useful compared to longer sessions. Three minutes is too less to do anything on a website - yes ? Probably most users were logging in, checking their account balance, or status updates or news feeds and then logging out !! The longer sessions are the ones that in reality would indicate meaningful user engagement which are most likely to be considered for monetization etc.,.

In a typical case, any number north of 5 or 7 minutes would've been an useful # to report !! Any operational decision just based on an AVERAGE KPI # of 2.5 minutes would've gotten bad results for the business - including a website redesign.

Effect of Distribution:

As we've seen with the example above - it's very important to look at the underlying data before interpreting any KPI reported through the distribution lens. In the case of the session duration above, the distribution was the not the typical inverted U shaped distribution that occurs with weights or heights of individuals in nature. Instead the distribution or spread of data was occurring very similar to a income distribution in a population. There are very few super rich and a long tail of high, middle and low income population - a distribution known as a Pareto distribution. In such cases, it's easy to see that AVERAGE() the most popular measure of centrality fails to deliver the intuition.

Average # of Connections per user in a Social Graph:

Today, It's fairly easy to get Social media data like Facebook or Twitter feeds. One of the most interesting measures in a social graph is looking at the # of connections per user and report it to the business/operational folks. For example if you have the social graph in the form of a table, Marketing would like to understand an aggregate over the # of connections per user. If we knew what the average # of connections per user is, it is easy to project the campaign costs etc.,  - yes ?

When I first did this exercise, I was curious to find out what the distribution looked like (burnt from the Average Session Duration episode before). Guess what ? The # of Connections per user also formed a Pareto distribution! In a Social Graph, there are a few users who have 1000s of connections and a long tail of users who have fewer and fewer connections. So reporting an AVERAGE would've been misleading if we had to do it naively.

Reporting Aggregates with Qualification:

As we saw with the two examples 'Average Session Duration' and 'Average # of Connections per user' - The most popular measure AVERAGE in the BI world misrepresented the underlying data and has the potential to drive bad decisions. So question then is, as a Data Scientist or Analyst what would be the responsible thing to report  and share insights ?

  • Always check the distribution for any aggregate you report. It's very easy to do histograms as shown in the Average Session Duration case above. Unless it's a inverted U shaped curve also known as the NORMAL distribution, Averages always have to be qualified.
  • Report Averages with a Percentile qualification. It's more accurate to say that the Average Session Duration is 2.5 minutes for the 90th percentile session or 50th percentile session etc.,. Tracking an Average Session Duration month over month on a certain percentile is more useful that doing it for all the data.
  • Segment the data into multiple dimensions and Report Averages on various segments. Instead of reporting Averages on all data, report Averages on specific Browsers, Areas of Website, Male or Female users, First or Repeat visitors etc .,

Those big Outliers - Shouldn't I be removing them ?:

The word 'Outliers' is a loaded term. If we do not look at the data from a distribution lens, everything that's hurting the aggregate or report will become an outlier !! One can keep removing smaller values or extremely large values and eventually loose the insight. In other words if you are spotting a lot of outliers with no end in sight, the data is probably not a NORMAL distribution . This is why percentiles are so important!

More Scenarios:

There are a lot more common scenarios where Averages are misleading to report as a single number. Here are a few more examples which has Pareto distributions:

  • You are reporting on Workloads in a cluster and trying to understand Average Query Time. I'd check the histogram. There are probably 1000s of smaller duration queries and fewer and fewer longer ones. So reporting Average is kind of misleading . Probably need to tile this and say the average query time for the 95th percentile query is XYZ etc.,
  • Average # of Queries per User - There are always a few # of users who do an extremely large # of queries with a long tail.
  • Anything else ?

Quickest way to compute Percentiles on billions of rows:

Checkout the ApproximatePercentile() SQL/MR function in the foundation guide! It can take billions rows and tile it in a distributed way really quickly. Traditional Percentile() functions bring all the data into 1 node and chop it into tiles. However ApproximatePercentile() does it very efficiently with a trade off between performance and acceptable error.