Data Science - Using Ensemble Methods in Machine Learning

Learn Aster
Teradata Employee

Great things are done by a series of small things brought together. - Vincent Van Gogh

Ensemble techniques come out as winners in any machine learning project. Once a data scientist starts using ensembles, the results tend to be so compelling that he or she will probably never go back to trying just one or two favorite standalone algorithms. Almost all of KDD or Kaggle contest leaderboard winners on predictive analytics contests all use ensemble methods. In this blog post I want to review the idea + provide some examples on how to do leverage this in a platform like Teradata Aster easily!

What are Ensemble learning and predictions ?

Ensembles are machine learning methods that are used to improve other machine learning algorithms. They allow a data scientist to combine multiple machine learning methods and produce predictions that are significantly better than super tweaking a single algorithm - with a lot less time spent on it too!

A really great example is Deep Learning and Deep Belief Networks which is an ensemble of neural networks and/or similar layered approaches. Multiple layers of learning is better than a single layer neural network - case & point! In addition to having multiple layers of the same technique, ensembles also combine the results of multiple algorithms (one of them could be Deep Learning) to create excellent predictions.

Why Ensembles work ?

A machine learning algorithm is designed to make some generalizations about the data and discovers the rules. It's based on a certain assumption on how data is laid out and on some underlying mathematical hypothesis. However in reality, data is sparse that it can actually fit multiple hypothesis equally well. So as opposed to deciding whether algorithm A, B or C fits best a particular situation, the ensemble method  tries to either aggregate the predictions or sometimes 'correct' the combined results!

Different Types of Ensembles:

There are three different types of ensembles that are used in practice.

  • Bagging (stands for Bootstrap Aggregation)
  • Boosting
  • Stacking

You can find quite a bit of content on all of this in wikipedia. But here's a short explanation on what they do:

Bagging: Split the data N ways and then train N models. Use some voting scheme for prediction on final data.

Boosting: Game the underlying data to 'fix' low signal ones, (that doesn't predict well in general) and boost 'weak' data for more importance. This is done by doing trial predictions on training data and identify the weak data.

Stacking: Cascade multiple machine learning techniques, so one ML algorithm learns the weakness of the previous one and makes corrections

Examples of Ensembles (which you can use on Teradata Aster)

Ensembles are so popular that you don't have to hand craft it all the time. The popular ones are already built for you. The best part is, it can exploit parallelism of the platform naturally as ensemble techniques are all divide and conquer strategies.

The RANDOMFOREST parallelized ensemble implementation is a bagging technique built on an underlying Decision Tree algorithm which produces results much better than a single instance of Decision Tree.

ADABOOST  is a parallelized ensemble MPP boosting technique built on the Single Decision tree algorithm.

MULTILAYER PERCEPTRON is a stacking technique with multiple layers of neural network.

If you'd like to build your own ensemble, you can very easily nest or combine multiple algorithms. The Multi Genre (TM) approach in Teradata Aster allows you to easily combine algorithms available on multiple frameworks like Map Reduce, Graph etc., or in other words build Ensembles really quickly with the SQL dialect. You can also use R as well.

My recent experience with a project: Stacking often is the easiest way to go. If you had 1 week of time on a prediction project, It's best spent on cascading a Random Forest algorithm on top of a Naive Bayes output, than tweaking the hyper parameters of either Random Forest or Naive Bayes separately. It's easy to get a 30% boost with this ensemble technique than getting a 10% boost tweaking one algorithm, feature selection etc., - the reason why Deep Learning is so successful :).