Data Science - Modeling Uncertainty with Multi Genre ...

Learn Data Science
Teradata Employee

I super like uncertainty.  Measuring uncertainty is the cornerstone of predictive analytics and machine learning. Most of the theories in statistics and probability are all around how do you put a number on something that could 'potentially' occur in the future given the different variables, past performance etc.,.  Whether it comes to who will win the Election primaries, Propensity to buy a product, Customer Churn, Predicting how stock market would do  etc., it's all about modeling uncertainty and quantifying it somehow and there is certainly more than one way to do it!

This blog  explores the basics w/o getting into formulas and such.

We toss a coin 10 times and get 7 heads and 3 tails. How should we calculate the odds for the 11th toss ? 

Swing the cat (Random): Just make a guess and check the result.

Frequentist (One truth): 7 out 0f 10 times we got heads,how about going with a 70% chance  for Heads and 30% chance for Tails for the 11th toss. History has proved that heads are more than tails isn't it ?

Baysean  (Changing belief): From the 1st toss all the way to the 10th one, the ratio to # of heads and tails constantly changes every toss. Sometimes it's 0.5,  sometimes it's 0.25, 0.75 etc., but it's likely to be between 0.5 and 1 given that the heads dominated. So let's use the prior experience and create a formula for the future. Assuming the last toss was a Tail, we can now come up with the Baysean formula (out of scope for this blog) that uses the prior knowledge to infer the likelihood of a Head or Tail for the next toss.

Transition Models ...

The above examples of Frequentist and Baysean approaches was extremely naive just to illustrate the point. However to build good predictive models repeated trials are essential. If the coin or the conditions of toss is biased, the models can help score better during prediction! Here's a couple of more that work with multiple trials or passes.

Markov Chains (Transitions) : If we look at two events at a time right from the first toss till the last, we'll get pairs like H+T, T+T, T+H, T+T etc.,  Let's say 50% of the time we get Tails followed by Heads, 25 % of the time Heads followed by  Heads, 12% Heads followed by Tails and 13% Tails followed by Tails. Since the last toss was a Tail, one can use the transition ratios in  formulas to calculate the odds for the 11th toss. Markov Chains predictions would work like a charm if a toss somewhat dependent on the previous toss ...

Hidden Markov (Genie with Transitions): There is a Genie whose mood changes every toss. It's happy 70% of the time and 30% sad the rest. When it's happy, it decides the Head 70% of the time and when it's sad, it decides Tail 70% of the time. Use this model to predict the 11th toss. Hidden Markov Model (HMM) will work nicely if there are more examples to train on than a single 10 step sequence ;). See my previous blog post on Hidden Markov ...

Technically Markovian ideas uses a bit of Baysean (priors) and frequentist approach and I think of it as hybrid.

Classification Techniques

We can also treat the 10 positions as unique variables and train a model on multiple trials with the 11th toss as a dependent variable and do machine learning!

Naive Bayes, Support Vector Machines, Logistic Regression, Random Forest etc., - all can be used to predict the 11th toss. If the conditions favor certain outcomes, these models would indeed capture that from the numerous trials to help with prediction.

Try and fail fast with Multi-Genre 

As with any other problem - in a multi-genre analytics world (with tools like Teradata Aster), you can indeed invent your hypothesis and test it in many ways and fail fast, not limiting to a handful classical techniques. Beyond trying individual techniques, you can combine algorithms in interesting ways with your best guess of reality and see if it is indeed true ! If your hypothesis mirrors the reality correctly, the prediction results would look great indeed ...