Evaluating Binary Classification Models

A Summary of Key Metrics

1 Intro

The metrics—precision, recall, specificity, and a few others—are commonly used to evaluate classification models. They all derive from the confusion matrix, which summarizes the results of a binary classification:

Predicted Positive Predicted Negative
Actual Positive True Positive (TP)
“hit”
False Negative (FN)
“miss”
Actual Negative False Positive (FP)
“false alarm”
True Negative (TN)
“correct rejection”

2 Precision

Precision is also known as Positive Predictive Value.

$$ \text{Precision} = \frac{TP}{TP + FP} $$

Precision answers the question: When the model predicts positive, how often is it correct?

3 Recall

Recall is also known as Sensitivity or True Positive Rate.

$$ \text{Recall} = \frac{TP}{TP + FN} $$

Recall answers the question: Out of all actual positives, how many did the model correctly identify?

4 Specificity

Specificity is also known as True Negative Rate.

$$ \text{Specificity} = \frac{TN}{TN + FP} $$

Specificity answers the question: Out of all actual negatives, how many did the model correctly classify as negative?

5 F1 Score

F1 Score is the harmonic mean of precision and recall. It is a single metric that combines both precision and recall into one number.

$$ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

6 Accuracy

Accuracy is the most straightforward metric, defined as the ratio of correctly predicted instances to the total instances.

$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$

7 ROC Curve & AUC

ROC (Receiver Operating Characteristic) curve is a graphical representation of a model’s performance across different thresholds. It plots True Positive Rate (Recall) vs. False Positive Rate (1 - Specificity) at each classification threshold.

ROC Curve

In the ROC curve above:

AUC (Area Under Curve) is the area under the ROC curve. summarizes model performance across different thresholds. A higher AUC indicates a better classifier. It’s often used when you want to evaluate the model’s performance without being tied to a specific threshold.

8 Which Metric to Use?

The choice of metric depends on the specific problem and the consequences of false positives and false negatives. Here’s a summary:

Scenario Key Metric
False positives are costly (e.g., spam filters, fraud detection) Precision
False negatives are costly (e.g., medical diagnosis, security threats) Recall
Need a balance between precision & recall F1 Score
Need to avoid unnecessary interventions (e.g., legal cases) Specificity
Balanced dataset, overall correctness matters Accuracy
Need threshold-independent evaluation AUC-ROC

Prepared by Jay / TDMDAL with litedown and ChatGPT.