The metrics—precision, recall, specificity, and a few others—are commonly used to evaluate classification models. They all derive from the confusion matrix, which summarizes the results of a binary classification:
Predicted Positive | Predicted Negative | |
---|---|---|
Actual Positive | True Positive (TP) “hit” |
False Negative (FN) “miss” |
Actual Negative | False Positive (FP) “false alarm” |
True Negative (TN) “correct rejection” |
Precision is also known as Positive Predictive Value.
$$ \text{Precision} = \frac{TP}{TP + FP} $$
Precision answers the question: When the model predicts positive, how often is it correct?
Recall is also known as Sensitivity or True Positive Rate.
$$ \text{Recall} = \frac{TP}{TP + FN} $$
Recall answers the question: Out of all actual positives, how many did the model correctly identify?
Specificity is also known as True Negative Rate.
$$ \text{Specificity} = \frac{TN}{TN + FP} $$
Specificity answers the question: Out of all actual negatives, how many did the model correctly classify as negative?
F1 Score is the harmonic mean of precision and recall. It is a single metric that combines both precision and recall into one number.
$$ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$
Accuracy is the most straightforward metric, defined as the ratio of correctly predicted instances to the total instances.
$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$
ROC (Receiver Operating Characteristic) curve is a graphical representation of a model’s performance across different thresholds. It plots True Positive Rate (Recall) vs. False Positive Rate (1 - Specificity) at each classification threshold.
In the ROC curve above:
AUC (Area Under Curve) is the area under the ROC curve. summarizes model performance across different thresholds. A higher AUC indicates a better classifier. It’s often used when you want to evaluate the model’s performance without being tied to a specific threshold.
The choice of metric depends on the specific problem and the consequences of false positives and false negatives. Here’s a summary:
Scenario | Key Metric |
---|---|
False positives are costly (e.g., spam filters, fraud detection) | Precision |
False negatives are costly (e.g., medical diagnosis, security threats) | Recall |
Need a balance between precision & recall | F1 Score |
Need to avoid unnecessary interventions (e.g., legal cases) | Specificity |
Balanced dataset, overall correctness matters | Accuracy |
Need threshold-independent evaluation | AUC-ROC |
Prepared by Jay / TDMDAL with litedown and ChatGPT.