Predictive modeling with Machine Learning in R — Part 2 (Evaluation Metrics for Classification)
“Predicting the future isn’t magic, it’s artificial intelligence.” Dave Waters
This is the second post in the series Predictive modeling with Machine Learning (ML)in R. For the first part, please refer to below.
Predictive modeling with Machine Learning in R — Part 1(Introduction)
Introduction
This post is like a prelude to actual classification model development. Once we build a predictive model, it is critical to evaluate the model to understand how good are the predictions and how can we further improve them.
So, let’s look at how to evaluate classification models. As seen in the above picture, there are quite a few metrics like Accuracy, AUC, Recall, Precision, etc. that could be used to evaluate a model. In order to explain any or all of these metrics, we need to understand the following terms first using the potato-tomato picture below. In this illustration, we are assuming that potato corresponds to positive class and tomato corresponds to negative class.
- True positive (TP)— if we predicted the output to be a positive class and it’s actually positive. In the figure, we predicted the class to be potato and it’s indeed a potato
- False positive (FP) — if we predicted the output to be a positive class but it’s actually a negative class. In the figure, we predicted the class to be potato but it’s a tomato. In statistics, this is also referred to as Type I error.
- False negative (FN) — if we predicted the output to be a negative class but it’s actually a positive class. In the figure, we predicted the class to be not potato, while it’s a potato. In statistics, this is also referred to as Type II error.
- True negative — if we predicted the output to be a negative class and it’s actually negative. In the figure, we predicted the class to be not potato and it’s indeed not a potato
If the above picture is replaced with a matrix of numbers, it would be called a confusion matrix.
Accuracy
Accuracy measures how often does the classifier predicts correctly (positive or negative). It is defined as below.
A higher accuracy model does not necessarily mean a good-performing model. Consider a scenario where we a dataset with 100 observations — 95 potatoes and 5 tomatoes. In this case, the model would likely predict 100% of the time that the output would be a potato. This would result in an accuracy of 95%, but the model isn’t doing a good job predicting tomatoes at all. This is the reason we should look beyond accuracy.
Precision
Precision measures how well the model predicts the cases that belong to the positive class. Specifically, it measures what proportion of positives is the model predicting correctly. Precision is useful in cases where FP is higher than FN. This metric is essential in fields like video/music recommendations, e-commerce websites, etc. where being less precise will lead the customers to move to competitors’ products or services.
Sensitivity
Sensitivity also referred to as Recall, measures how many positive cases the model was able to predict correctly. This is used when FN is higher than FP. This metric is critical in the medical domain where in spite of false alarms it is important that the positive cases should not go undetected.
Specificity
Specificity is like sensitivity, but for negative class. In medical terms, it measures a model’s ability to predict how many patients were without a disease.
F1 score
It’s a combination — harmonic mean — of Precision and Recall. It’s very useful when FP and FN are equally costly and when TN is very high (e.g., more healthy individuals than patients).
AUC-ROC
The Receiver Operator Characteristic (ROC) is a plot between true positive rate (Y-axis) and false positive rate (X-axis) at various thresholds. Higher values of X would mean a more number of FP than TN, while higher values of Y would mean more number of TP than FN. So, thresholds would depend on the ability to balance between FP and FN
Area Under the Curve (AUC) is a well-balanced metric to assess any model’s predictions. The higher the value of AUC (which ranges between 0 and 1), the better is the model’s performance. So, a value of 1 would mean the model perfectly classifies the two classes of the output.
Conclusion
We have learnt various metrics to assess model performance. Depending on the problem at hand, the balance of observations among the output classes, objective (reduce FP or maximize TP, etc.), we have to choose a metric or a combination of metrics to report our model’s performance.
Now, let’s get started with some hands-on experience of classification using machine learning algorithms in the next post.