Data Science - Find the Statistical model or Follow the Algorithms ?

Learn Data Science
Teradata Employee

There are two broad approaches to a data science problem and a lot of blog posts and white papers talk about it. This is purely my perspective as a data practitioner.

Data science problems can be solved by either Statistical Modeling or using an Algorithmic approach. Both has somewhat common mathematical underpinnings and the so called algorithmic approach uses a form of non-linear statistics, so they are really like cousins ...

All models are wrong, but some are useful - George Box

Find the Statistical Model:

Statistical models when done correctly can explain all of the data w/o looking at every single row in your big data environment. We can do the hypothesis tests and fit the data to the model ... One needs however, to find the sampling strategy such that the model + parameters that gives you the least error. With a statistics approach, you don't try to build models on all data. The rationale is that, if you sample enough number of times, the parameters should be somewhat stable (law of large numbers etc.,). So Terabytes of big data is really not that different for a statistical approach compared to a tiny CSV file! The model could be explained with 1/100th of data *IF* it could be fit!

"Data has to belong to one of the many models that exist aka fit the data to one of the models" 

Follow the Algorithms (Recent):

Statistical modeling requires finesse. The premise of the Algorithm approach is that data has a mind of its own and cannot always be explained away by a statistical model. What if there is stuff that's inexplicable in the data ? How about trying different algorithms and learning from the data and build models that could've generated it ?

"Instead of fitting the data to a statistical model, how about figuring out the bias, variance, separation etc., from the data using algorithms " 

Non Linear/Non Parametric/High Dimension Data: Data with non-linear or piece-wise characteristics or sparse data like text, requires a lot more work to get a statistical model right. Instead, how about using techniques like Machine Learning etc., to scan through the entire data without any assumptions or prior hypothesis. Let the algorithm create a model "for the data" instead "for a model" !

Background on why we have many approaches today:

Statistical models have existed for 200+ years. It's an art & strategy to get a perfect sample of the entire data and do the modeling on the sample, without looking at all the data. The stratified sampling technique, estimates/errors, approximations and measures were developed, because it was resource wise intractable for  long time to solve problems with "all data". Emphasis was made on data collection strategies instead. Here are roughly the steps to build a statistical model:

  • Build a hypothesis
  • Find factors, control and confounder variables
  • Decide Sampling
  • Variable selection
  • Clean data and create additional variables
  • Run an initial model and refine it
  • Check and resolve data issues - outliers etc.,
  • Find errors and start again until error drops to acceptable levels. Worst case, start with a new hypothesis.

However, in the last 6-7 years, we've the luxury of hitting all the data at scale if you want to (there are some exceptions). Not only that, we can run different algorithms that 'learn' from the data without a prior hypotheses. The new types of data we are seeing in the big data era have all kinds of complexity. Text, Time Series, Graph, Images, Audio, Large Sequences (gene data) - a lot of them very highly sparse and collected without a strategy up front. Here are roughly the steps with Big Data Machine Learning.

  • Do some basic data cleansing
  • Split the data to training and test (Training can be 80% of the data)
  • Decide predictor and dependent variables (Features)
  • Run PCA or other Dimension Reduction techniques optionally to reduce the dimensions
  • Run the ML algorithm on the training data with some hyper-parameters to build a model
  • Use the model to predict on test set and measure accuracy
  • Iterate with data, features and hyper-parameters until desired accuracy is obtained.

Under-fitting vs Over-fitting:

The goal of an analytical model is to make decisions on data that will occur in future with good accuracy. Both statistical models and algorithmic approaches can create under-fitting or overfitting if not carefully designed. Regularization (model complexity penalization) is a common theme for both and techniques are built in the approach to get the best accuracy!

Model Tuning vs Hyper parameter optimization:

To get a statistical model right, you need to get the right 'model parameters' to fit the data to one of the model. For getting a algorithmic approach right, you need to tweak the hyper parameters. Model Tuning and Hyperparameter optimization both come with its own challenges. Both have techniques to tune so the predictive model is accurate for a future input.

When is Statistical sampling/Modeling better than Algorithmic approach ?

Sampling is an art and strategy. Getting the perfect sample for your data is the holy grail and it is certainly possible in a number of situations. If you know the nature of data ahead of time, like it exhibits a Normal, Log Normal, Poisson etc., distributions, it's easier to find a model with parameters from a sample. Again, it's an art that works best for domain & statistics experts. Building a model out of sample is highly performant as we are only looking at a fraction of data. A data scientist can build the model in a Desktop! Statistical models provide relatively more explainability and easy to communicate to the business ...

Images (c) Wikipedia

When is Algorithmic approach better than building Statistical modeling ?

If the data is non linear or piece-wise or has huge # of dimensions or sparse, it may be better if models are "learnt" from all the data. 

Examples: Text, Graph, Time Series (especially with non linear numeric values), Sequences (Genetic) etc., 

You can always do some sampling initially to see if you are getting consistent model parameters. But if the data is complex, we want to resort to something like machine learning as opposed to fit the data to some hypothesis. Doing machine learning models on all data is faster & easier with big data MPP platforms, so there is no computational issue (most of the times). Machine learning models generally provide more accuracy in exchange to explainability a lot of times. It's a tradeoff well known especially as we start using techniques like Deep Learning etc.,  

Images (c) Wikipedia, Wikimedia Commons

Hybrid Multi-Genre Approach (Post Modern):

Now, wouldn't that be nice ? :)

Data scientists can try both approaches. It only takes a few min in some platforms. Kick off a Statistical Modeling exercise like Logistic Regression AND do large scale Machine Learning like Random Forest at the same time. The Data Scientist decides the winner by comparing both approaches. If there is an easy statistical fit for a model in the first few hours use it. If not go with MPP Machine Learning Approach and create a prediction model. Check out the Teradata Aster MPP Platform where you can do both - Statistical Modeling & Machine Learning with Terabytes of data in minutes if not hours. You can play with "all" text, graph, time series, images etc., not just big relational tables with a bunch of numbers.

Sample size to use for Statistical Modeling can range from 1000 samples to millions of sample rows in a MPP Big Data environment! There are Sampling functions available in Teradata Aster platform as well to choose the size and technique.

Proof of Success with Cross Validation:

Both Statistical Model and Algorithmic approaches can use large scale Cross Validation as well - Split "all" your historical data to a training and test data set. Build your Statistical AND Algorithmic models on the training set and use the test set for validation. Tune both and let the best approach win!

Other References :