Why synthetic data matters now
The scarcity of high-quality, task-specific training data has always been the bottleneck for fine-tuning. Collecting and labeling real data is slow, expensive, and often requires domain experts who have better things to do.
Synthetic data — training examples generated by LLMs themselves — has emerged as the practical solution. Nearly every major model released in the past year has been trained, at least in part, on synthetically generated data. The technique works. But the details of how you generate that data determine whether you end up with a better model or an expensive echo chamber.
The core risk: model collapse
Training a model on its own outputs — or on outputs from a model in the same family — creates a feedback loop. Each generation smooths out the distribution, loses tail cases, and converges toward a narrower range of outputs. After enough iterations, the model loses the ability to handle edge cases entirely.
This is not a theoretical risk. It has been documented in multiple research settings and shows up in production as a model that performs well on common inputs but fails unpredictably on anything unusual.
Designing a generation pipeline that works
Start with real seed data
The best synthetic data pipelines don't start from scratch. They start with a small set of high-quality, real-world examples — typically 50–200 — and use those as seeds for generation. The seeds define the target distribution: the topics, the difficulty levels, the edge cases you care about.
Use structured variation
Naive generation ("generate 1,000 examples like this one") produces homogeneous data. Instead, systematically vary the inputs along dimensions that matter for your task. For a customer support classifier, that might mean varying the tone (angry, confused, polite), the complexity (single issue, multiple issues), and the domain (billing, technical, account management).
The most effective approach we've seen uses a generation matrix: a grid of attributes that ensures coverage across the space of inputs you expect in production.
Filter aggressively
Not all generated examples are useful. A good pipeline generates 3–5× more data than it keeps, filtering on criteria like correctness (does the output actually answer the input?), consistency (does it follow the format?), difficulty (is it too easy or too hard?), and deduplication (is it too similar to existing examples?).
The filtering step is where most of the quality comes from. Skimp on it and you're training on noise.
Quality verification strategies
LLM-as-judge
Using a stronger model to evaluate the outputs of a generation model is now standard practice. The judge model scores each example on relevance, correctness, and adherence to instructions. Examples below a threshold are discarded.
This works well for most tasks but has a blind spot: the judge and the generator share similar biases. They'll agree on outputs that "sound right" even when they're subtly wrong. For high-stakes domains, supplement LLM-as-judge with human spot-checks.
Consistency checks
Generate the same example multiple times and check whether the outputs agree. High variance across generations suggests the task is ambiguous or the prompt is under-specified — both signals that the data point should be revised or discarded.
Held-out validation
Always keep a set of real, human-verified examples as a held-out validation set. Train on synthetic data, evaluate on real data. If the gap between synthetic-data performance and real-data performance is large, your generation pipeline has a distribution mismatch.
Common failure modes
- Sycophantic data — The generator produces outputs that are polished and confident but avoid taking positions or handling ambiguity. Models trained on this data become fluent but unhelpful.
- Format over substance — The examples look right structurally but contain shallow reasoning. The model learns the template, not the skill.
- Distribution shift — The synthetic data covers the cases you anticipated but misses the cases that actually occur in production. This is why seed data from real traffic is essential.
The goal of synthetic data is not to replace real data. It's to amplify the signal in the real data you already have.
A practical recipe
For teams getting started with synthetic data generation, here's a workflow that consistently produces good results:
- Collect 100–200 real examples that represent your target task well
- Define a variation matrix with 4–6 dimensions relevant to your domain
- Generate 5–10× your target dataset size using a frontier model
- Filter using LLM-as-judge plus automated quality checks
- Deduplicate using embedding similarity (threshold: ~0.92 cosine)
- Validate on a held-out set of real examples
- Iterate on the generation prompts based on where the model underperforms
This loop typically requires 2–3 iterations before the data quality is good enough for training. Budget your time accordingly.