Evaluating the Model’s Prediction Accuracy
Last updated
Last updated
Numerous measures of performance are used to evaluate the prediction accuracy of machine learning models (Ref-BB). Consider a simple case of a typical binary classifier, Figure 2XX describes the contingency table and common ratios used to evaluate this classifier.
As shown in the figure, key four numbers are first determined including number of true positives, the number of false positives, the number of false negatives and the number of true negatives. A true positive means that the model classifies a positive data point as positive, while a false positive means that the model classifies a negative data point as positive. Similarly, a true negative means that the model classifies a negative data point as negative, while a false negative means that the model classifies a positive data point as negative. Several metrics are derived using these four numbers. One should note that these metrics are in the form of ratios which make them independent of the dataset size. While each metric provides different insight on the model’s prediction performance, we review a few of these metrics that are widely used in model evaluation exercises.
Figure: The contingency table and common ratios used for evaluating a classifier (Source: Wikipedia https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers)
Accuracy: It measures the closeness of the predicted values to their corresponding actual ones. It is calculated as the ratio of the number of cases in which the model made correct predictions to the total number of cases.
Precision: It measures the model’s ability to detect truly positive cases. It is calculated as the ratio of the number of true positive cases to the number of cases predicted by the model as positive cases.
Recall: It measures the model’s ability to detect truly positive cases. It is calculated as the ratio of the number of cases in which the model classifies a case as positive to the total number of true positive cases.
Specificity: It measures the model’s ability to detect true negative cases. It is calculated as the ratio of the number of cases in which the model classifies a case as negative to the total number of true negative cases.
Another important measure is the Receiver Operating Characteristic (ROC) plot (REF-CC). The ROC plot is used to examine how well a classifier can separate positive and negative cases. Figure 3XX gives a sketch of an ROC curve for a typical classifier. On the ROC plot, the y-axis is true positive rate (TPR), and the x-axis is false positive rate (FPR). The diagonal line on this plot represents the random classification scenario. An ROC curve that is well above this diagonal line indicates a model with strong classification ability.
Figure: A sketch of the ROC plot for a typical classifier
Regression-based models tends to use measures such as the Mean Absolute Error and the Mean squared error. The Mean Absolute Error (MAE) is the average of the difference between the true value and the predicted value. It measures how far the predictions are from the actual output. However, it does not provide information on the error direction (under-predicting the data or over predicting the data). Mathematically, it is represented as:
The Mean Squared Error (MSE) is the average of the square of the difference between the original values and the predicted values. The advantage of MSE is that it is easier to compute the gradient compared to the Mean Absolute Error which requires complicated linear programming tools to compute the gradient. As we take square of the error instead of the absolute value, the effect of larger errors becomes more pronounced than smaller error, which guides to reduce these larger errors. The MSE is written mathematically as follows.
Where and are the actual and predicted values, respectively, and is the number of evaluated data points.