Data Science - Correlation, Causation & Implication Rules ...

Learn Data Science
Teradata Employee

I really love this topic because it has so much business relevance & ROI element to it. It's a fun one too, to think about use cases around these concepts. Every line of business loves prediction. If X happens, we want to know what follows that or if some event Y is likely to happen! If a Customer browsed a bunch of web pages and added items to a cart, we want to know if they will check out or not. If we can predict reliably 90% of the time, data products can be built around abandoned carts, like sending coupons and such. Similarly if we know 90% of the time that product A and  B is bought, product C will always be in the basket - it's power to the retailer.

This is where the ideas of correlation, causation and implication rules play a role. The three of them are really close in definitions and can sometimes confuse someone who is new to data science.  However the subtleties is where the returns are. This blog post is an attempt to understand the three ideas.

What is Correlation ?

Correlation is a simple statistic measure. We look at how two variables change over time together. For example, "Umbrella" and "Rain"; If someone grew up in a place where it never rained and saw rain for the first time, this person would observe that whenever it rains, people use umbrellas. Also if it doesn't rain, folks wouldn't carry an umbrella. By definition , 'rain' and 'umbrella' is said to be correlated!

So what's missing here? Well this person knows that rain and umbrella is connected, but wouldn't know is if umbrella caused the rain or the rain caused the umbrella  (this person doesn't pay attention in life in this example). So Correlation doesn't say which was responsible to other! This is what Causation solves.

Causation:

Rain causes umbrellas to come out - that's causation. As you can infer, Causation has a "before" and "after" thing to do it. If there is past, present and future elements to variables to play with, then it's a step towards causation. If this person had to observe that rain came first and then the umbrellas came out  and also that when it stopped raining, umbrellas were put away - it would be a step towards closing in on Causation. So what are we missing?

Lurking Variables:

This is the one that catches folks blindsided as they jump to conclusions about causation. A great example that I read recently. Some folks who don't remove shoes at night before bed always ends up getting a headache in the morning when they wake up. High correlation between shoes and a headache were observed, but causation eluded this example. Why? Well in this case, we were not told that these folks drank too much each night and didn't bother to remove their shoes. Drinking actually caused the headache. Observations only included the shoes and headache variables in this case. The unobserved variable "Drinking" here is also known as the "Lurking Variable".

With perfect Correlation in observation data, a data scientist could easily  come to wrong conclusions/explanations by not being aware of the lurking variable(s).

Implication Rules:

Implication Rules are prepositions that says that if A is present, then B also is present. It is denoted by A->B.

As you can see we are not saying A causes B, but we are just saying that B occurs when A occurs. Also that B is less likely to occur if A does not occur. It makes a Correlation connection, but still doesn't explain Causation. Implication rules are very popular in the output of associative mining, where we look at the market basket typically. If people buy milk and bread, then they are more likely to buy XYZ. Also the implication rules will automatically establish that people will NOT buy XYZ when there is no milk and bread in the basket. Tricky to prove that causation exists (who knows what we don't know?) , yet the analysis is huge key to make a short term business decision.

Finding Causation Insights:

This is something a data scientist cannot do without getting a business person involved. You can find all kinds of correlations and implication rules, but the fact is that the lurking variables will be outside the observable data, hence causation is generally found in a consultative manner with the business.

Can ML Prediction algorithms work without knowing Causation ?

Yes. Correlations are sufficient for building machine learning models from data and we can predict. It will work without fully understanding the causation. However that will be true as long as the lurking variables don't change !! Models will break if lurking variables will change w/o notice and ML models have to be retrained and trained again to compensate that ...

Conclusion:

With tools like Teradata Aster Multi-Genre Advanced Analytics (TM), data scientists can run correlations on 100's of columns of data in one pass. They can also find key implication rules by crunching through billions of rows of retail market basket data. One can do dimension reduction, machine learning, build models and run prediction +  do pathing to see the dominant event sequences that lead to a target state -  Visualization is done using  a tool like App Center. With that it would only be a couple of more steps away to learn about Causation once the Correlation and Implication insights are communicated to the business folks.