A held-out dataset used exclusively to evaluate a trained model's performance on unseen examples — providing an unbiased estimate of how the model will perform in the real world.
In Depth
Test data is the final exam of machine learning. After a model is trained on training data and tuned using a validation set, it is evaluated exactly once on the test set — data it has never seen and which was not used in any training or tuning decision. The test set performance is the most honest estimate of how the model will perform on new, real-world data. Any other performance numbers — training accuracy, cross-validation score — are less reliable indicators of real-world behavior.
The critical rule is that test data must be completely isolated from the model development process. If test data is used to make any decision — choosing between model architectures, selecting hyperparameters, or deciding when to stop training — it is no longer a valid test set. This 'peeking' leads to optimistic performance estimates that don't hold in deployment. In practice, the data is typically split into three parts: training (model learning), validation (hyperparameter tuning and model selection), and test (final unbiased evaluation).
Data leakage — where information from the test set inadvertently influences training — is a common and dangerous error. It can occur through feature engineering applied before the train-test split, temporal leakage (using future data to predict the past), or simply through repeated use of the same test set across many experiments. High-stakes AI applications in medicine, finance, and law require careful data governance to prevent leakage and ensure reported performance reflects real-world expectations.
Test data is the ultimate reality check — the only honest estimate of how a model will perform on the world it has never seen. Its integrity must be protected absolutely throughout the development process.

