Change Point Detection for lunchboxes

Learn Data Science
Teradata Employee

When analyzing a time series data set we sometimes want to detect those points in time where there is a significant and abrupt change.  

Aster offers a ChangePointDetection function that does exactly that. The function looks back at the available data points and applies a binary segmentation search method. The algorithm executes these key steps:   

  1. Find the first change point in our time series.
  2. From that point, split the data into two parts.
  3. In each part find the change point with the minimum loss (as calculated by a cost function).
  4. Repeat until we have found all the change points.

Before we can learn more about this function we need a data set to explore. We can download the Online Retail Data Set from the UCI Machine Learning repository (link). 

Let's load the csv data into a new Aster table "retail_sales_cpd" and review an example.

Our data set includes 541,909 rows. We pick one sample customer and product:

In the output we see that a customer from the Netherlands tends to place very large orders for vintage spaceboy lunch boxes. The price is very static, except for one order.  

The quantity varies wildly. We see significant up and down changes (red boxes  throughout the order history.

Of course with large data sets we do not have time to manually sift through the data and create visual plots. Let's review what the ChangePointDetection function can do for us. 

Function Syntax:

Besides the normal function parameters there are a few additional parameters that we need to study more carefully:

  • We will partition the input data by customer and product and sort using the invoice date.  
  • The ValueColumn is the key field of interest where we want to detect changes. For our dataset we can pick qty, price or qty *price.  Note that we can only specify one single column.
  • Accumulate is where we specify the identifying columns that we used in the partition and order by clause (such as customer_id, product_id and invoice_date). These extra output columns will help us interpret the results.
  • SegmentationMethod allows us to choose normal_distribution or linear_regression. (default = normal_distribution) 
  • SearchMethod is always set to binary. This is the only option for the function. We do not have to explicitly specify this parameter for that reason.
  • MaxChangeNum specifies the maximum number of changepoints to detect (default = 10)
  • Penalty can be BIC, AIC or a specific static threshold.  The BIC and AIC criteria are used to evaluate the differences between the chosen change points are the original data. The penalty is included in the cost function as a guard for overfitting.  BIC is the default option.
  • OutputOption can be CHANGEPOINT, VERBOSE or SEGMENT. This option allows us to output only the
    • changepoints (cptid column)  This the default option.
    • changepoints and calculated differences  (between the estimations for the H1 and H0 hypotheses)
    • specific segments that have been detected. 

We invoke the ChangePointDetection function and use linear regression to perform the segmentation:

Note that while we can use the ACCUMULATE feature to output additional columns, I prefer to join with the source table to get a full picture.

Reviewing our basic line chart again. If we circle higher qty change points in red and lower qty change points in green we get this result:

Obviously the change points do not always correspond with straightforward highs and lows. If they did we would not need to have the function do all the calculations. A simple sql windowing approach could accomplish the same.

Change detection on retail data can highlight those customers that have unique requirements and shopping habits. Possibly this group of customers is at higher risk of churn or lower satisfaction and it is a good idea to perform further analysis using other techniques.

To quickly find those customers of interest and products with a higher number of change points we can aggregate our results. 

Since our example is using a retail data set one question comes to mind: does seasonality impact the results? Yes, change detection algorithms do have a harder time with time series that include seasonality. It is recommended to remove the seasonal component if your results are below expectations.

So what is the value then?

In our example we reviewed retail sales using the quantity sold.  We can apply the same technique to averages, counts, standard deviations. This opens the door to various use cases such as fraud or intruder and anomaly detection where a tangible corrective action is possible.

Another example could be a rise in call center complaints. A change detection analysis can pinpoint the time where one or more events triggered the increase in call volume.  

In manufacturing the strength of a part is affected by a change in the input materials.  

Image result for manufacturing part defect

Change point detection can go back in time , go through the historical sensor data and highlight the time stamps where changes occurred. Those time stamps can potentially be linked to a supplier switch,  different batch of input materials or a change in the operating environment.   


Online Retail Data Set (link)

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

Change Point Detection: a powerful new tool for detecting changes (link)

Change Point Detection with seasonal time series (link)