Data Science - Machine Learning vs Rules Based Systems

Learn Aster
Teradata Employee

A recent video deck of this topic from #TDPARTNERS17

Video Link : 1101

Last 25 years, we've built operational models on traditional data warehouses/databases - a lot of them attributed to embedded rules in the code. If X then do Y else if P then do Q etc., types. Rules are extremely easy to understand and were developed in the past by domain experts and consultants who translate their experience and best practices to code to make automated decisions.

When a system gets operationalized, one starts with 100 scenarios and then  writes 100 rules to handle it. As time goes by we encounter more and more exceptions and  start making more rules to keep exceptions under control. What's wrong with this approach ? Think Tax Code or any Regulatory framework - yes, things do get unwieldy and cumbersome over time with Rules Management especially as the data  changes faster than one can keep up with the rules !!

There comes a point, where nobody really knows or can measure how well your rules work or how many exceptions you have - This is the situation today with a lot of legacy operational systems. Whether it's marketing attribution, compliance related, fraud detection, cybersecurity or alert mechanisms built on realtime events, the rules based systems are dinosaurs in the big data world - the volume, velocity, complexity and variety  of data makes it near impossible to do well.  Increasing number of false positives and negatives can wreak havoc in your operational systems  with no useful actionable results ...

What are the alternatives to replacing rules ?

First I want to make it clear that I'm not against a rules based system for any religious reasons. Rules based system will work perfectly "IF" you know all the situations under which decisions can be made. In the past it was easy to determine what the rules were as data was more structured from the get-go & constrained and the scope was limited. Not only we have more data than before, but with unstructured data like text and semi-structured data like JSON and XML from different sources, rule writing just turned out to be not so practical anymore. So we need to adopt solutions that will help us 'discover' the rules from the poly structured data. It's called Machine Learning

Machine learning algorithms can actually help build rules on the fly "IF* we can show the algorithms good vs bad in the data or data that you can classify into class A, B, C, D etc., !! Not only that, as we find more and more  data, it is easy to retrain the classifiers rather quickly and frequently, so these advanced algorithms can continue to make decisions with the latest and the greatest.

Evaluation vs Scoring - Deterministic vs Probabilistic:

In a Rules based system where you say - "If there is a word X followed by a word Y within 10 words, then trigger this", it's considered a deterministic approach. The Rules typically yield a 'Yes' or a 'No'. If you miss a scenario, you can get a false negative or a false positive leading to confusion.

Compare that to a Machine Learning system where there is a prediction score for each scenario based on how it was trained. If the prediction score exceeds a certain threshold, an alert gets triggered. Machine Learning models depending on the algorithm can be modeled and predictions scored in a myriad # of ways - But it's possible to make it fully data driven. If the model is not taught correctly or if there is a 'learning problem', you get false positives and negatives as well.

Choosing the right algorithms or combining them, preparation of data where ML would work accurately etc., is becoming less and less of a black art each day.

Why is a Machine Learning System always better in the long run ?

Very simply put - convenience, scalability & low operational cost - especially as we start dealing with unstructured, structured and semi-structured data inputs. Machine Learning system removes the manual task of classifying and tweaking rules each time. Fixing rules manually over time is like fixing bugs as the code gets bigger - problem becomes harder  like adding to a house of cards.

With Machine Learning, one also has the ability to measure effectiveness and improve it by only changing algorithms or algorithm parameters with science. It's far more easier to iterate and get good prediction results than a rules based system.

What about Real-time alerting with machine learning?

This is the area that is under development and getting operationalized slowly - Scoring machine learning models real-time is why there is streaming technologies built into the Teradata UDA ., While modeling is done as a batch process, scoring is done with edge processes to trigger alerts real-time. Today some machine learning algorithms have real-time scoring abilities. Most of them are  CPU intensive to be working in real-time effectively. In future this will not be a problem as CPUs get more powerful. More in my next blog.

How do I know that the machine learning algorithm is doing what it is supposed to ?

Just like how code is written, any machine learning process in production requires a lot of 'cross validation'. In other-words, the models need to be tested with automated tests with real-life 'ground truth' and verification done to make sure it's consistently producing less false positives and negatives that are acceptable with known data sets. There are even plots like "ROC" curves that can one can use to see on how well the ML models were performing yesterday, previous week, previous month etc.,

What algorithms should I choose to replace my rules ?

There are machine learning algorithms for numeric, text, mixed. There are algorithms that can return TRUE or FALSE and some can emit multiple classification labels around which you can trigger different alerts. This is an area that requires data science expertise. Knowledge of algorithms, it's limitations and strengths and statistics can come handy. See link in the end of the blog post that has links to 10 min videos on running advanced algorithms quickly ... You can also learn about these algorithms in data science courses online.

When should I upgrade my Rules based systems to Machine Learning ?

Well, if there are limited # of rules in the system and one is getting pretty low false positives and negatives AND the data is highly structured, a machine learning system could be an overkill. However if you are spending enormous amount of operational cost to maintain and run a rules based system on big data, it's time to start looking at different Machine Learning based approaches ...

Where do I learn about Machine Learning so I can play with it ?

See this Aster Community portal link that gets you started with some training videos, history of machine learning etc., + how to run algorithms like Naive Bayes, Support Vector Machines etc., in Aster Platform. A lot of the training videos are under 10 minutes ...

http://community.teradata.com/community/learn-aster/machine-learning