The Underfitting and Overfitting Problems
Last updated
Last updated
Underfitting and overfitting are serious problems in model development (Ref-DD). The Underfitting problem is the result of using limited data and/or inadequate model structure to fit this data. Underfitted models tend to fail to accurately replicate the data points in both the training and the testing dataset. The overfitting problem is encountered when the model is able to replicate the training data points at high accuracy and fails to replicate any unseen data points (i.e., testing data points) at the same accuracy. In other words, the model tends to capture the noise in the data rather than the true relationship between the dependent and independent variables.
Figure: Example of under-fitting and overfitting of the training data
Figure 4 shows the model fitting for three different cases. In case I, an underfitted model is obtained. The model structure is too simple or inflexible to learn important features in the dataset. In this figure, a linear mode structure is assumed and hence the model fails to capture the nonlinear relationship between the variables. In case II, an adequate model is obtained which reasonably fits the data and captures the general trend between the variables. Case III shows an example of an overfitted model. Adopting a complex model structure that includes too many parameters relative to the size of the dataset is a common reason for obtaining an overfitted model. In the extreme case, when there is one data point for each independent variable, the model could be able to remember this data point, resulting in an overfitted model. Thus, as mentioned above, excluding a portion of the data only for the purpose of testing the model is crucial for identifying if the model is suffering an overfitting problem or not.
Based on the above discussion, one may recognize that the ultimate remedy to the underfitting and overfitting problems is to select a model structure that is suitable to the problem under study while ensuring that an adequate data size is available. In addition, the exploratory analysis of the data, as mentioned above, helps to avoid these problems by performing adequate feature engineering and eliminating/consolidating features with limited observations.
Nonetheless, several other techniques are reported to deal with these problems. For example, the ensemble learning technique is proposed to enhance the performance of models that suffer from both the underfitting and overfitting problems (Ref-EE). The ensemble learning technique works by combining predictions from multiple separated models. The idea is to obtain a strong learner by combining multiple weak learners. The early-stopping technique is proposed for overfitted models in which the used optimization procedure is stopped before reaching its final convergence point (Ref-FF). Finally, the regularization technique artificially forces the model to be simpler (Ref-JJ). Pruning and dropouts are examples of applying the regularization technique in decision trees and neural networks, respectively. More discussion on model regularization is provided in section 1.4.7.