In this blog, the well-known bias verses variance balance is discussed in the context of what it means in practice. While this is common lore for many, it is worth reviewing and keeping in mind as we go about our merry way modeling data for business value!
Ultimately, our goal as data scientists is to use machine learning algorithms to train a predictive model on observed historical data, and then use that model to help our customers improve their business. There are many choices of models that might be reasonable approximations to the real or true relationships/functions we seek to predict. It’s important to note first, that the “real” function is not known. And secondly, as practitioners in the art of data science we can only go as far as to characterize the degree to which our model is consistent with the observed data. Following the rule of Occum’s razor the simplest models consistent with the data are preferred. Accordingly, measures of the goodness of a model, such as the Akaike’s information criteria, include a factor that favors models with fewer parameters.
The prediction error for any machine learning algorithm can be broken down into three components:
The decomposition formulas for these can be derived with classical statistical methods. The mathematics is necessarily different for the continuous output of a logistic regression model, verses a discrete binary classification technique, such as a support vector machine. In practice, the real bias and variance error terms can’t be calculated because the actual underlying function is not known. However, it is instructive to investigate the bias and variance trade-off in machine learning to understand algorithm behavior and predictive performance.
The noise is the difference between the true underlying function and the data. For our discussion we will assume the true function is known to be a sine function. R can be used to generate data that follows this function with an additional random (i.e. the error) component.
Bias refers to the simplifying assumptions made by the algorithm to make the target function easier to learn. Generally, parametric algorithms have a high bias making them easier train fast and to interpret at the cost of flexibility. On the other hand, they tend to have less predictive power with complex problems that fail to meet the assumptions of the model. Algorithms like the Aster GLM are considered high-bias. The plot below shows the use of the general linearized regression model to fit the simulated sinusoidal data to a polynomial up to 4th order. Notice that the 4th order curve in red appears to already do a reasonable job at modeling the data.
Variance refers to the amount that the estimate of the target function changes when a different set of training data is used. In practice, don’t want the model to change too much from one training dataset to another. This ensures that the algorithm is good at representing the relationship between the inputs and the output variables.
Machine learning algorithms that have a high variance are strongly influenced by the specifics of the training data. In extreme cases, this is referred to as an “ill-posed” problem. Sometimes this is handled with an appropriate regularization method that constrains the bias from getting too small. Generally, nonparametric machine learning algorithms that have a lot of flexibility will exhibit high variance. For example, the output of the Aster decision tree function will show a significant dependence on the training data set. This may be even more pronounced that if the trees are not pruned. Other relatively high-variance machine learning algorithms in Aster include k-Nearest Neighbors and Support Vector Machines.
The plots below compare the fitted curves for a number of initial training data sets to show the effect of variance. The graph on the top for the 4th order polynomial shows much less variance, i.e. sensitivity of the resulting fit to the training data set. The graph on the bottom for the 20th order polynomial shows much more variance. In effect the noise is being fit at the expense of general predictability. For the case of a well posed problem, the solution's behavior should vary continuously with the initial conditions. In the example here, the low bias models (i.e when the degree of the polynomial is large) begin to show the sensitivity on the initial data set that is indicative of an ill-posed problem.
What does this mean for the art of data modeling?
All machine learning algorithms are subject to a trade off in the error due to bias. This includes the common Aster palette of functions, like generalized linear model, decision trees, support vector machines, etc., Asymptotically the bias vanishes with model complexity, while the variance increases. The total error, i.e. the sum of these components, exhibits a minimum at the optimal model complexity as illustrated in the graph below.
ANY ATTEMPT TO INCREASE THE MODEL COMPLEXITY
BEYOND THE "SWEET SPOT" IS OVER-FITTING!
Practicing data scientists will realize that feature selection and dimensionality reduction can help decrease variance by simplifying the model. Conversely, adding predictors tends to decrease bias at the expense of additional variance. In the context of machine learning, model selection involves hyper-parameter optimization to choose the best performing set for the given learning algorithm. More specifically, hyper-parameter optimization ensures that the model does not over-fit the data. The performance can be evaluated using cross-validation. As shown in the graph below, this error hits a minimum near the 6th degree polynomial and then blows up as the degree becomes unreasonably large. This process is in contrast to the goals of training the model to learn the parameters that reconstruct the true underlying function well.
In addition to model considerations, as the size of the training data set itself grows towards infinity, the model's bias will asymptotically approach zero (asymptotic consistency) and will have a variance that is as good as any other candidate model (asymptotic efficiency). In other words, with a larger training data set as is typical in many big data problems the variance tends to decrease. This means that an algorithm may have little bias (read tendency to be over-fit) when there are a million training points, but may show very significant bias with only a few hundred data points.
The machine learning functions in Aster typically have tunable parameters that control the bias and variance. For example:
For cases where the dilemma is intractable, ensemble methods like the Aster RandomForest function can be tried. These are typically a last resort because they are computationally expensive and may not be practical for some big data problems. They work by averaging together many low bias models and taking advantage of the fact that averaging many independent samples from a probability distribution with a given variance and mean returns the same mean with a reduced variance.
Stay tuned for more on Aster ensemble methods in a coming blog…
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.