Data Science - Use of Hyper-parameters in Predictive Analytics

Learn Aster
Teradata Employee

In this blog post, I'd like to highlight the importance of hyper-parameters. Whether it's a single use of algorithm or ensemble in your data science project, the role of hyper-parameters should not be underestimated. Better knowledge of the hyper-parameters means more informed decision making, such that algorithms are more performant and accurate under different conditions. Mathematically speaking, we want our setup to be "more regularized" ...

What are Hyper parameters and why is it different than Model parameters ?

When we train a predictive model using some data and a popular algorithm, the output of that is a model file or object. The model file is very specific to the algorithm and the training data distribution. The model will then be used by the predictive algorithm on a new set of data to either classify into labels or do an numeric estimate. The information in the model has the algorithm specific representation of the underlying data and are called the Model parameters. The expectation is, after a lot of trials on different types of training and test sets & feature selection, we will settle on some stable Model parameters which can be used for prediction reliably.

Feature selection, Dimension reduction & more data is used for Model Parameter tuning

Hyper-parameters are different. Hyper-parameters are the 'configuration knobs' that we use to tweak the algorithm or data preparation. Hyper-parameters operate in the 'outer realm' of predictive modeling. Some examples of hyper-parameters:

  • The number of trees or tree depth in a Decision Tree algorithm
  • The # of neurons or layers, learning rate values that we use in a Deep Learning algo.
  • If you are using the Maximum Entropy/KNN Text Classifier in Teradata Aster, the arguments ClassifierParameters, NLPParameters & FeatureSelection

One distinction to make in data preparation is that "feature selection" is not a hyper-parameter exercise. However partitioning the data or doing aggregations in the feature data set would be considered as tweaking hyper-parameters ! For example if you decide to roll up a week of data values into a single row, you essentially made a decision to 'compress' the information. That would be a hyper-parameter change. However if you add 10 new columns in your data set, it would be a model tuning exercise.

Configuration tweaks applied to an Algorithm or decisions to aggregate input data in some ways, goes into Hyper-parameter tuning ...

Deciding the model training/prediction workflow is becoming more science,   however finalizing the "optimum" hyper-parameters is all witchcraft/sorcery today ...

Are there tools to optimize for good Hyper-parameters ?

Today, many new techniques such as Grid Search, Baysean optimization  exists . However it still remains an art form as the combinations of algorithm parameters that you can use can be huge with respect to each input data set/accuracy output. Sometimes randomly searching for parameter combinations for best accuracy could be as effective as an orderly search ! See my earlier blog post on "Dealing with  Hypothesis Space" for the problem in more detail from a practitioner's point of view...

As of today,  you can use the Caret R package from CRAN that provides the latest support for Hyper-parameter optimization.  Caret R  has the ability to run training/test iterations on your favorite algorithm to decide on optimum values using algorithms like GridSearch etc.,. If you have Teradata Aster platform, today you combine the Caret R package and the Teradata Aster R algo. libraries in the same R script. Still far from reliably finding the best values for each situation, you can expect this space to be of more interest to researchers and practitioners in near future ...