Data Science - IOT Pattern Discovery with SAX

Learn Aster
Teradata Employee

The Discovery Challenge with IOT data:

As Internet of Things (IOT) data explodes, here are a few challenges:

  • Storage
  • Parsing from semi-structured stream
  • Classifying or Clustering Time Series Data
  • Finding Patterns or Motifs that indicate an imminent part or system failure

Storing & parsing challenges with big IOT data can be mitigated in a powerful MPP system with enough nodes & ELT Parsers (such as JSON). However challenges remain in finding signals or patterns or motifs in the data with regards to last mile  of component failure, anomaly detection and also doing clustering, classification etc.,  This blog post is about an algorithm called SAX, an abbreviation for Symbolic Aggregate Approximation. SAX is a dimension reduction or discretization technique for time series data ...

SAX was invented by Eamonn Keogh and Jessica Lin in 2002, using funding from NSF Career Award 0237918. 

SAX basics:


Time series data poses couple of problems. First it's large especially especially when it's applicable to sensor data. Sensors generate numeric data every second at least. If you are reading sensor data from an Aircraft engine or Wind Turbine, you can expect to get 100s TBs of data each day across 100s to 1000s of sensors !! With IOT , you can imagine the scale when it comes to millions of users even with just 10s of sensors per user. The problems are generally to find correlations or patterns in these multiple series of time series data that can be indicators of a component or system failure. May be it's vibration, oil pressure, temperature or a couple of other variables combined together ... The computational demands on processing 1000s of streams of data can be unrealistic even with a big investment in infrastructure. Random Sampling or doing arbitrary  aggregations of the time series the data does not help as important signals in short time windows can get lost - there is also a problem of 'normalization'. How to create representations of data from multiple streams so they are 'comparable' with some technique ?


SAX or Symbolic Aggregate Approximation reduces a big time series or sequence of integers/decimals to a stream of alphabets. If you have a day worth of decimal data (86,400 seconds) from a home thermostat, by choosing the alphabet range and other parameters, it can reduced to a stream that look like:




The length of the string is a fraction of the the original time series X axis !!  Compare the reduced 10+ or 100+ alphabet string with the original stream of data which had 86,400 data points for each second. More examples around SAX strings:







The alphabets represents an average across a time window (X axis) and the analog sensor reading (Y axis). The data is also normalized to the mean. Normalization is a key feature of SAX. This opens up interesting possibilities for multi-genre predictive/prescriptive analytics using Classification, Clustering, Machine Learning, Pattern Search & Indexing time series data to 'look up' anomalous behavior etc.,With SAX you can use mainstream text functions to build models from these alphabets and even apply Markovian transition models by leveraging Alphabet transitions.


Here is an example from the original white paper by Eamonn Keogh & Jessica Lin (2002) . This is the original time series:




SAX emits these alphabets as representative of this stream:  ACDBBDCA. It is very easy to develop intuitions around the data stream. A being a value in lowest tile in Y axis and D in the upper most tile. The alphabets are 'aggregates' derived at 8 intervals between 0 and 250 in the X axis.


Information loss during 'SAXification':

SAX is a dimension reduction technique. It compacts a time series data both in X & Y dimensions. There will be information loss as you cannot reconstruct the original stream from the alphabets. However the interesting patterns do not get lost provided the correct arguments are used. These arguments are usually obtained through iterative discovery process.

SAX string generation with Simplicity & Scale:

Teradata Aster has a SAX SQL/MR function that can generate SAX strings from raw time series data stored in a table. The function takes an Alphabet Size (A,B,C...Z) and Interval and generates the strings per id or partition!  For example, if you have 1 year worth of time series data for different types of sensor_ids for 1 M users, you can get a SAX string per user per sensor in a single pass in a few minutes ...


Multi Genre Analytics with SAX output:

Once SAX strings are generated, it can be manipulated fairly easily and becomes an interesting input to different 'text'  algorithms. We will discuss some use cases & possibilities next:

  • Part Failure Prediction
  • Similarity Clustering with Account Balances
  • Searching for Anomalies/Defects/Interesting Events

Part Failure Prediction:

Part Failure Prediction can be modeled on SAX data from multiple sensors by looking at the last mile alphabets for each failure stream. See below:


On the four SAX streams of data above, the last alphabet is almost representative of a relative value within the stream that occurs prior to a failure. Once we build ground truth on the Last Mile alphabets for all the streams, it can be fed to a Naive Bayes/ Support Vector Machine or Random Forest Classifier algorithm to build a predictive model. By adjusting the interval sizes and Alphabet range, it is possible to achieve a really accurate model. One can also apply Hidden Markov Method (HMM) for building latent inference models on the streams. All the algorithms mentioned here are available as SQL/MR functions in the Teradata Aster Discovery Platform.

Similarity Clustering with Account Balances:

There are lot of scenarios where one would like to cluster time series data that are similar to each other. SAX alphabets are good candidate as inputs to text functions such as NGRAM/QGRAM,  TF/IDF and VECTOR DISTANCE (which computes Cosine Similarity). SAX streams can be treated as strings and split into 2 or 3 alphabets at a time (QGRAM) to be fed to a Vector Space Model. A Cosine score of 1 would mean two streams are similar to each other. A Cosine score of 0 would mean no similarity  and everything else provides a gray scale on how two streams are similar.

One of the popular use cases is to cluster users whose account balances fluctuate over time in a similar way.

Searching for Anomalies/Defects/Interesting Events:

Searching on SAX strings for anomalies is quite common - a popular use case is finding defects or interesting events in time series data. Let's assume  we have a ALPHABET SIZE of 6 => A,B,C,D,E,F  for a  temperature SAX stream. We know that A and F are extreme low and high values in the stream. If engineering finds that having a high value of F consecutively for a length of time can lead to instability in the system, we can search on past streams very easily for consecutive high values using a simple SELECT query:

SELECT * FROM vibration_sax_output WHERE sax_value like '%FFFFF%';

One can also use regular expressions on SAX output to find interesting anomalies ...



1 Comment
Teradata Employee

Please review this excellent video from Dr. Greg Bethardy: