Data Scientist - Common Gotchas with Machine Learning/AI Models

The best minds from Teradata, our partners, and customers blog about relevant topics and features.
Teradata Employee






Let's start with the basics.


Models are only as good as your data ...


Models are very sensitive to data - the staple food of any information-driven organization. After all, the learning and training happen by looking at data. If you have issues with the quality of data or don't have a whole lot to cover for all possible dimensions, your models will reflect that. Models are more sensitive to data than the algorithms itself. We will start with that:


1. Data & Feature Quality will affect both Model accuracy and consistency

The quality of data we feed to the models on a regular basis from an EDW or Data Lake through ETL processes etc. determines how accurate the models are and how consistent it is no matter what algorithms one chooses to train and predict. For models to be useful, it's probably more important to have slightly lower accuracy and consistent than the other way. Have 95% accuracy in prediction one day and dropping to 30% accuracy on the other day is probably not desired in an operational env.


What can affect data quality?

Gaps in data (like missing data for a few days here and there) is the most common one. Data can arrive late from servers or devices upstream or software issues, or wrong code can send you bad data, without any ETL process catching it. A new transformation process or event code or parameter change can create bad data especially with numeric data that's hard to detect. Just a simple misunderstanding of taxonomy at the collection point can get your data with a slightly different meaning that you think it is.


What about Feature Quality?

The whole idea of using data for ML models is the expectation that the data will have features that can be extracted to model the real world, they came from. You may have a lot of big data, but if the 'feature quality' is poor then the models will reflect that. For example, if you are building a multi-variate model and using multiple streams of data, we all know that it's important that they both have unique information aka no feature collinearity. Also critical that external factors that affect those features are somehow factored into the equation.


2. Simple sophistication vs. Complex approaches


You don't always need to use the most sophisticated algorithms and most complex data and features you collected

If you have a lot of data like Google, Facebook or Amazon does, it makes sense to do Deep Learning, etc., where the data comes with "low bias and high variance." Of course for accuracy, you tradeoff computation cost, explainability, etc.,


If you have 100K rows of data, it probably has low variance and high bias. You can mainly use algorithms like Logistic Regression, RandomForest, Support Vector Machines or Naive Bayes and get pretty good results. Also, it's somewhat easy to explain the results and runs a lot faster. Using Deep Learning will require you to examine overfitting and just a lot of trouble, and in end, you will probably get accuracy as close to mainstream algorithms.


How about 10M-100M rows? Well, now we have options, and you can decide based on some factors. Logistic Regression, Random Forest, SVM, and Naive Bayes would work just as fine. Deep Learning would probably do well too and would require some basic hyperparameter tuning, regularization options, etc., to get it right ...


3. Are you Overfitting? - check


This is a common theme that data scientists know. Don't model on your training set and try to predict on it, to get your accuracy numbers. Your first model will probably always overfit. Requires some iterations/cross-validation/regularization etc., to loosen the model, so it generalizes well. Maybe try switching algorithms?


4. Are the Classes balanced? - check


If you are doing classification, then if your data is skewed heavily towards a few classes vs. others, you have a class imbalance. Your model accuracy will be so-so when predicting leading to too many false positives and negatives. One thing to see if we can do either data augmentation or resampling or trying different algorithms that has different regularization methods etc.,


5. Are you Cursed with Dimensionality? - check


Modeling on a few columns or features is one thing. If you have 100+ features or columns or attributes, you probably will run into a phenomenon called 'the Curse of Dimensionality'. The data can get sparse, and model accuracy can vary quite a bit under different circumstances. Techniques such as 'Dimension Reduction' may need to be done to avoid the issue.


6. How is the "inference" performance of the model? Does it match operational requirements?


It's great to build models with low false positives and negatives every single time we try to predict, but if the scoring gets into some iterative mode real-time - that's not good. If you are scoring 1000s of customers each second on something on your 'inference farm', it's wise to pick an algorithm that will return faster given some input data. The speed of decision making is what matters in the end for most scenarios.


7. Correlation or Curve Fitting does not imply Causation


Machine learning tries to capture "correlation reality" obtained from data to predict a situation based on the model. However, it cannot know that what variables cause which. However, it can detect that the presence of specific variables always coincides with the absence or presence of another variable. So if the underlying conditions that drive these observed variables change then, the model assumptions can go wrong with a drop in accuracy.


Read my previous blog post on 'Correlation, Causation and Implication Rules' for more insights into this topic.


8. One size fits all vs. multiple models


Should you build one model for all data or should you build a model per customer or product? This question is overloaded and has tradeoffs in accuracy, performance, data imbalance (some customers have more data than the other), etc., This is a very informed decision!


What else?


The above are the common gotchas that are just the tip of the iceberg. Then, you have the whole DevOps integration pipeline that requires a lot of production level sophistication like continuous integration/development, etc., to deploy and manage model lifecycles .. (to be covered in another blog in future!)


Thanks for reading!