It may sound incredible and even uncanny, if I tell you that the data you have, chooses the analytic approach & algorithms it wants to work with !! It's like saying, a perfect sunset photograph chooses the camera or how a cat decides its owner !
As one works with an increasing number of data science projects, it becomes more and more apparent (not just to me), that the data chooses which quantitative approach that works best with it! Feel free to disagree and/or comment on my blog post ...
More power to the analysts and data scientists who 'listen' to the data, and then choose the algorithms. But most analysts start out with their 'best practices' or assume that 'This algorithm has worked successfully for me before'. But this may not be natural to the current data. This 'presumed' approach can lead to major frustration.
The best data science results are achieved by listening to the data. This is something that every data scientist must learn how to do. Very careful curation of the data with histograms, distributions, questioning the data and understanding the premise and the environment, will lead to good decisions on matching the right algorithm to the right data.
The other approach is to do 'overload analysis'. Try all the predictive algorithms that exist, on the data and see which one sticks - if indeed, you have the luxury of owning a best platform out there! (Hint: Aster Platform!)
Data reflects reality and the algorithms are designed to work with real data. If data reflects physical reality (# of customers arriving, income distribution, # of products they buy, cart abandonment etc), then the algorithms will work effectively. That is why it is extremely difficult to work with synthetic/manufactured data. In fact, it's an art to create synthetic data that models reality. Personally, I like to stick to real data for the best results
Here are some examples from nature:
Algorithms are designed for known distributions (like poisson, normal etc). The math & problem solving built into the algorithms are around assumptions of how the reality is (aka data distribution) in different scenarios.
It is very important to understand, what the data distribution is. Basic statistical techniques (like a GROUP BY, COUNT(*), TOP N) can give an insight into how the data is spread out. You should be able tell if the data is bad, or assembled incorrectly by looking at the distribution. There are analytical functions that can profile your data to find what distribution it is and the key parameters.
Algorithms are not only designed around distributions of reality, but also on the data types. Some algorithms are designed for numeric data, others for categorical data and yet others for mixed data. So the data types also help narrow down on algorithms that may be suitable for the data.
There is nothing like having a lot of algorithms at the data's disposal !! If algorithm A fails, try algorithm B etc. Having a standardization on usage, makes it easier to try things, until something works. As I mentioned before, this falls under the category of Multi Genre Advanced Analytics (TM) where one can try and fail with different approaches and learn about what algorithm works best with the data using hind-sight.
Yes and No ! . Algorithms are not only based on some assumptions about some data distributions, but are also designed to discover broad generalizations. Today, there is no such thing as a 'Super Algorithm' that can do predictions on any data given to it - that's the holy grail category. Even the most sophisticated techniques like Deep Learning, require the data scientists to setup multiple layers of learning and configure according to the data provided, experiment and finalize the model through multiple iterations .
It's great to understand specific instances on why people buy product X, Y or Z or churn, but knowing the distributions and data types can actually help narrow down what analytic approaches might be suitable for the entire data. Of course, you always can fall back to a multi-genre approach where you can test all your assumptions about the data, fail fast and learn. At the end, the data would have had it's say !
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.