Model Validation Techniques
Last updated
Last updated
In most cases, the bias-variance optimization problem to minimize out-of-sample prediction error while training various models based on minimizing within-sample prediction error does not have a closed-form solution, so approximation techniques such as parameter grid search or random search are needed. In addition, data sampling can often select biased validation data that seriously undermine model validation. Model validation techniques such as the holdout method and k-fold cross-validation are often used in training more robust models that are less susceptible to overfitting.
In section 1.3.1 for Model Training and Testing, the holdout method with validation and test data set was discussed. In summary, this method requires the modeler “holdout” or reserve a portion of the data for validation, and a portion of the data for testing. The data for validation is used to search for a model parameter that minimizes out-of-sample error for validation, while the data for test is used to evaluate the selected model from validation.
The holdout method is simple to conduct and addresses the needs of most models. However, it is susceptible to sample selection bias and may be difficult to replicate. The k-fold cross validation is a more robust model validation method that address these issues. As shown in Figure BWX2, the data is first divided into k folds, each corresponding to the test data in its respective iteration. For each iteration, the respective test data is used to validate the model; the model is trained using the complement set of training data. The results from k iterations are averaged to produce a single estimation. Because k-fold cross validation produces an averaged estimate of model parameters using k iterations of validation, it is validated with the information from the full dataset, making the result of model estimation replicable. In addition, since the process of model validation is done on k-folds of data iteratively, the out-of-sample error can be minimized without the potential risk of biased validation sample selection. For more advanced machine learning models with more than one hyperparameters, a nested version of the k-fold cross validation can be implemented, where there are additional L-nested inner loops, and the data is divided into k * L folds for cross-validation.
While k-fold cross validation should address most modeling needs, it will not address more advanced modeling issues such as model drift or autocorrelation in time-series modeling. It also cannot prevent model bias resulting from insufficient data. Since model structure and feature are selected prior to model training and validation, cross-validation will not fix poor model performance due to incorrect model specification either, although cross-validation can often be applied to feature selection to inform model specification.
Figure BWX2: k-fold cross validation uses various partitions of test data (also referred to as validation data or out-of-sample data) in k iterations of model validation before model testing with final Test data. (Scikit-learn developers, 2019)