Two complementary metrics for evaluating classification models. Precision measures the proportion of positive predictions that are correct; Recall measures the proportion of actual positives that the model successfully identifies.
In Depth
Precision and Recall are the two most important metrics for evaluating classification models, particularly when the classes are imbalanced. Precision answers the question: 'Of all the items the model labeled as positive, how many actually were positive?' A spam filter with 95% precision means that 95% of emails it flags as spam really are spam (and 5% are legitimate emails incorrectly flagged). Recall answers: 'Of all the items that actually were positive, how many did the model find?' A cancer screening with 98% recall means it correctly identifies 98% of patients who actually have cancer (and misses 2%).
Precision and Recall are inherently in tension — improving one typically reduces the other. A cancer screening system can achieve 100% recall (never missing a case) by flagging every patient as positive, but its precision would be terrible (most flagged patients would be healthy). Conversely, it could achieve near-perfect precision by only flagging the most obvious cases, but would miss many real cancers (low recall). The F1 Score — the harmonic mean of precision and recall — provides a single metric that balances both, but practitioners must still decide which matters more for their application.
The choice between precision and recall depends on the costs of different errors. In cancer screening, missing a real case (low recall) is dangerous — recall should be prioritized. In email spam filtering, incorrectly blocking a legitimate email (low precision) is the bigger annoyance — precision may matter more. Understanding this tradeoff is fundamental to deploying classification models responsibly. The Precision-Recall curve, which plots precision against recall at different classification thresholds, is a standard tool for visualizing and optimizing this tradeoff.
Precision measures correctness of positive predictions; Recall measures completeness of positive detection. Their tradeoff determines whether a model avoids false alarms or avoids missed cases.