Data Science - How Data is the Boss of Analytic Approaches ...

Learn Aster
Teradata Employee

It may sound incredible and even uncanny, if I tell you that the data you have, chooses the analytic approach & algorithms it wants to work with !! It's like saying, a perfect sunset photograph chooses the camera or how a cat decides its owner !

As one works with an increasing number of data science projects, it becomes more and more apparent  (not just to me), that the data chooses which quantitative approach that works best with it!  Feel free to disagree and/or comment on my blog post ...

More power to the analysts and data scientists who 'listen' to the data, and then choose the algorithms. But most analysts start out with their 'best practices' or assume that 'This algorithm has worked successfully for me before'. But this may not be natural to the current data. This 'presumed' approach can lead to major frustration.

Data Driven Approach

The best data science results are achieved by listening to the data. This is something that every data scientist must learn how to do. Very careful curation of the data with histograms, distributions, questioning the data and understanding the premise and the environment, will lead to good decisions on matching the right algorithm to the right data.

The other approach is  to do 'overload analysis'. Try all the predictive algorithms that exist, on the data and see which one sticks - if indeed,  you have the luxury of owning a best platform out there! (Hint: Aster Platform!)

Reality Check

Data reflects reality and the algorithms are designed to work with real data. If data reflects physical reality (# of customers arriving, income distribution, # of products they buy, cart abandonment etc), then the algorithms will work effectively. That is why it is extremely difficult to work with synthetic/manufactured data. In fact, it's an art to create synthetic data that models reality. Personally, I like to stick to real data for the best results

Here are some examples from nature:

  • All graphs created from reality,  # of connections for a user (facebook, twitter, cell phone contacts etc.,), # of edges to a vertex, all follow a power law distribution.  We see that fewer people have thousands of connections and a lot of people have decreasing number of connections. No graph in nature has an equal distribution of connections !
  • The same power law distribution is exhibited on how many seconds people spend on the website. Just look at the # of seconds people spend on your website, from your web logs. You will always find that a huge # of people spend very little time with a decreasing long tail of people who spend more and more time. The power law also applies to income of the population. Here's are some slides on the  degree distribution exhibited by  real world graph networks and you can see how it exhibits the power law.
  • Another example: If you measure the height of people in your community, a high school or your company, it always follows a inverted U shaped curve (Gaussian or normal distribution). Peoples' weight also follows the same kind of distribution.
  • If you look at the distribution of how people arrive to a meeting, or a  concert or how parents drop off their kids at schools, it follows a Poisson distribution (which is neither power law nor normal distribution). At the event start time, a majority of people arrive. But a fewer people arrive close to 5,10 or 15 minutes before and after the event time.  In fact, if you understand this distribution, you can predict how many people will be in the concert or the movie by looking at people who arrive 30 min, 20 min, 10 min earlier !!! Thus distributions allows for reliable forecasting.

Algorithms are designed for known distributions (like poisson, normal etc). The math & problem solving built into the algorithms are around assumptions of how the reality is (aka data distribution) in different scenarios.

Dude - What is my data distribution?

It is very important to understand, what the data distribution is. Basic statistical techniques (like a GROUP BY, COUNT(*), TOP N) can give an insight into how the data is spread out. You should be able tell if the data is bad, or assembled incorrectly by looking at the distribution. There are analytical functions that can profile your data to find what distribution it is and the key parameters.

What other factors influence algorithm choices, other than data distribution ?

Algorithms are not only designed around distributions of reality, but also on the data types. Some algorithms are designed for numeric data, others for categorical data  and yet others for mixed data. So the data types also help narrow down on algorithms that may be suitable for the data. 

A simple trick to tackle any data distribution

There is nothing like having a lot of algorithms at the data's disposal !! If algorithm A fails, try algorithm B etc. Having a standardization on usage, makes it easier to try things, until something works. As I mentioned before, this falls under the category of Multi Genre Advanced Analytics (TM) where one can try and fail with different approaches and learn about what algorithm works best with the data using hind-sight.

Aren't there smart algorithms that can be used no matter what data you give it ?

Yes and No ! . Algorithms are not only based on some assumptions about some data distributions, but are also designed to discover broad generalizations. Today, there is no such thing as a 'Super Algorithm' that can do predictions on any data given to it - that's the holy grail category. Even the most sophisticated techniques like Deep Learning, require the data scientists to setup multiple layers of learning and configure according to the data provided, experiment and finalize the model through multiple iterations .

Conclusion:

It's great to understand specific instances on why people buy product X, Y or Z or churn, but  knowing the distributions and data types can actually help narrow down what analytic approaches might be suitable for the entire data. Of course, you always can fall back to a multi-genre approach where you can test all your assumptions about the data, fail fast and learn. At the end,  the data would have had it's say !