Data Science is a pretty overloaded term. We have many platforms, languages, algorithms and as a practitioner, we all have our favorite tools etc., In this blog post, I will explore two of the main tenets that are behind every data science exercise. Sometimes, it could be intertwined, but it is super important call it out for clarity.
However, the two popular goals are:
Discovery & Finding Insights
Modeling & Prediction (dev or production/operationalization)
with analytics powering both. Every single project that we do is either/or or a mixed one. In some cases we do more discovery & almost no modeling/prediction. Sometimes modeling/prediction is done with known discovered elements, sometimes we go back and forth! Modeling/Prediction can happen both in Dev and Production.
Isn't Discovery part of the Modeling/Prediction exercise?
Yes and No. This is the confusing part for many folks. Pure Discovery, which deserves its own category is stumbling into something unexpected when you are looking (OR) not looking for a particular outcome. Great examples of discovery are Penicillin, Xrays, Finding America :) etc., The keen eye of the scientist or the explorer stumbles into something that is inexplicable outside a belief cloud.
Hmmm, that's odd. It shouldn't be there ...
Most CSI sitcoms capture the essence of discovery in an investigation and hence has a captive audience! Discovery & Finding is what makes life so interesting. In the Data Science world, it is usually about finding the early insights to a business problem.
Tools for Discovery & Finding Insights
We need tools for discovery to parse new things. Without an optical or a radio telescope, you cannot find planets. Even with the telescope, only a prepared mind would spot something that most people miss. However, a telescope is still necessary :)
Modeling and Prediction
After a scientist discovers one or more interesting outcomes, we now want to take it to the next level. Which is find more in an automated way! Modeling and Prediction allow us to do that. We know what we are looking for, so let's create a system to learn from that and find more of that. Create the conditions on how we can repeatedly find things, generalize and find more. Operationalize by automation ...
Some examples of Insights, augmented by Modeling and Prediction:
We use an Insight finding tool to conclude that people arrive with a particular browsing pattern seem to click on the 'buy' button. After the first 'aha' moment, we now cleanse the data, weed the noise and build path models to 'learn' using quantitative methods from historical data. For an unknown path, we can now predict confidently that they are most likely to click on the 'buy' button or not! Solving the last mile problem can improve conversions significantly. If we do not have the discovery tool, you may not even know where to look!
We use an Insight finding tool to find the first insight into customer call center comments. We find a cluster full of "similar" comments that talk about a topic that always seems to be referring to some interesting phrases and words. Once we have the starting point, we can now create quantitative techniques to capture more clusters or more comments with spelling mistakes and human errors, but it is easy because we know we want to do more of X.
Finding Insights without modeling/prediction:
It is a science project. You wowed the audience with what you found. Not quite useful after the announcement.
Modeling/prediction without a deep Insight element:
The business context is diluted. Great, you can predict things, but you cannot quite explain why that is interesting in the first place to the stakeholders.
How much time spent on each typically?
Discovery in a pure sense is often finding something by chance. The discovery in the day to day science world is really about finding the key drivers. Power to the data scientists who find the insights to a business problem and getting to the first AHA moment. Typically it takes a few weeks with the right platform & talent. Try this link first.
Modeling and Prediction from Dev to Production is usually the long tail as algorithm fitting and other things that go with that domain. It can go anywhere from a couple of weeks to a few months for model automation and it depends on data quality etc., to move to operationalization. Many options here available as well!