A learning paradigm that combines a small amount of labeled data with a large amount of unlabeled data during training, achieving better performance than purely supervised or unsupervised approaches alone.
In Depth
Semi-supervised Learning addresses one of the most practical constraints in Machine Learning: labeled data is expensive, but unlabeled data is abundant. A hospital may have millions of scans but only thousands annotated by radiologists. A content platform may have billions of posts but only a fraction reviewed by human moderators. Semi-supervised Learning bridges this gap by extracting signal from both.
Common techniques include self-training (a model iteratively labels its own high-confidence predictions and adds them to training data), pseudo-labeling, and consistency regularization (ensuring the model makes similar predictions for the same input under different augmentations or perturbations). These methods encourage the model to find smooth, consistent decision boundaries that respect the structure of unlabeled data.
Semi-supervised Learning has become especially important in Natural Language Processing and Computer Vision. Pre-training a large model on vast unlabeled text or image data — then fine-tuning on a small labeled dataset — is the dominant paradigm for building state-of-the-art systems. BERT, GPT, and most modern foundation models are, in essence, semi-supervised learners.
Semi-supervised Learning is the pragmatic compromise between the cost of labeling and the power of supervision — letting you achieve strong performance with only a fraction of the labeled data that fully supervised approaches require.
Real-World Applications
Frequently Asked Questions
Why use semi-supervised learning instead of supervised learning?
Because labeled data is expensive and time-consuming to create. A hospital might have millions of medical scans but only a few thousand labeled by radiologists. Semi-supervised learning leverages both the small labeled set and the large unlabeled pool, often achieving 90%+ of fully-supervised performance with just 1-10% of the labeled data.
How does pseudo-labeling work in semi-supervised learning?
The model is first trained on the small labeled dataset. It then predicts labels for the unlabeled data, and the predictions with the highest confidence (pseudo-labels) are added to the training set. The model is retrained on this expanded dataset. This process can be repeated iteratively, with each round adding more pseudo-labeled examples.
Are foundation models like BERT and GPT semi-supervised?
Yes, in a broad sense. They are pre-trained on vast amounts of unlabeled text using self-supervised objectives (predicting masked words or next tokens), then fine-tuned on smaller labeled datasets for specific tasks. This pre-train-then-fine-tune paradigm is the dominant application of semi-supervised principles in modern AI.