Model Training and Testing
Last updated
Last updated
Model training, validation and testing is the step responsible for determining the optimal values of the model parameters to ensure that the model is able to predict its output at a high accuracy level (Ref-AA). While this step is the core of any machine learning model development, experienced modelers tend to spend more time on several other important steps before jumping into this step. These steps include exploratory analysis, data cleaning, and feature engineering. Properly conducting these steps is expected to reduce the time and effort involved in model training and validation.
The main objective of the exploratory analysis is to help the modeler to get to know the dataset. A good exploratory data analysis enables the modeler to gain an insight on required data cleaning effort, identify important features to focus on, and identify any missing data elements. In addition, the exploratory analysis helps the modeler to feel the dataset for better presentation and communication of the results. Basic statistics and data visualization are common tools to perform this analysis, including, for example, numerical distribution plots, segmentation plots, categorical distribution plots, and correlation matrices.
A data cleaning step is essential for almost every model development task. The effort associated with this step is expected to significantly increase if no rigorous quality assurance processes are followed during the collection and the entry of the used data. A data cleaning step involves removing duplicate and irrelevant observations, fixing structural errors (e.g., due to typos), filtering outliers after careful examination, and handling missing data if possible.
An important step that is performed as a part of the exploratory data analysis is feature engineering. The purpose of feature engineering is to focus on key information to expedite the number of iterations needed to develop a good model. A deep domain knowledge is usually required to adequately perform this step. While an endless number of features could exist for any given problem, this step is not intended to determine all these features but rather to identify the key ones. Feature engineering includes infusion of domain knowledge through adding one or more indicators to better describe some of the observations, defining interaction among variables (association, sum, difference, product, etc.), and reducing dimensionality by combining sparse classes.
After completing the exploratory analysis, the next step is to select the best ML algorithm that is expected to give the highest prediction performance. Three general problem classes are generally tackled by machine learning models: regression, classification and clustering. As mentioned earlier, the first two problem classes are solved using a supervised learning approach, while the third one is solved using an unsupervised learning approach. This primer provides an overview of well-established machine learning algorithms that are widely adopted for these problem classes. These algorithms differ in their underlying theory and scope of application as some algorithms tend to perform better than others depending on the studied problem and the available dataset. The selection of the best algorithm to use requires a good understanding of these algorithms and their suitability and applicability to different problems.
A machine learning model usually involves two types of parameters: model parameters and hyperparameters. The values of the model parameters depend mainly on the dataset used for model training. The values of the hyperparameters are mostly independent of the dataset as their values are determined following sensitivity analysis and trial and error processes. Modifying the values of the hyperparameters in these iterations is commonly referred to as model fine-tuning.
As any other modeling approach, the size of the data required to develop a machine learning-based model depends on the problem under study. In general, problems with high variability of their predicted outcomes are expected to require large datasets to capture this variability. In addition, as shown in Figure 1XX, developing the model requires splitting the available dataset into three strictly exclusive subsets for model training, validation and testing. While the training subset is used for fitting the model and obtaining the optimal values of its parameters, the validation and the testing subsets are used to provide an estimate of the model’s prediction performance. This method is known as the holdout method for model validation and it is important in reducing model overfitting. A key distinction between the validation and testing processes is that the prediction performance obtained from the validation process could be biased as its results are used to fine tune the model’s hyperparameters, which is not the case for the testing process. As such, the results of the testing process are the one ultimately used for benchmarking the prediction accuracy of the model and proving its generalization ability.
Figure: The split of the dataset for model training, validation and testing
Training a machine learning model requires the minimization of an objective function (a.k.a. loss function) that describes the difference between estimated and true values for a given data instance. This problem is usually formulated as an unconstrained minimization problem with a quadratic objective function. This function sums the squared differences between estimated and true values for all points in the dataset. Several efficient algorithms are developed in the literature to solve this minimization problem, which yields the optimal values of the model parameters. These algorithms are implemented as part of most machine learning model development tools. As these algorithms are heuristic-based, there is always a trade-off between finding the optimal solution of the problem and the execution time (number of iterations) required to obtain this solution. In most practical applications, modelers are satisfied with a near optimal solution that can be obtained in reasonable execution time. Therefore, they might result in different answers to the same problem. One can view the optimization algorithm to be used as one additional hyperparameter to be determined. As such, the performances of the available optimization algorithms are usually compared to select the best optimizer.