/ THE CORE

Synthetic Data for Model Training: A Practitioner's Playbook

Generating training data with LLMs is now mainstream. But the difference between useful synthetic data and expensive noise comes down to a handful of design decisions.

Pipeline diagram showing synthetic data generation, filtering, and training stages

Why synthetic data matters now

The scarcity of high-quality, task-specific training data has always been the bottleneck for fine-tuning. Collecting and labeling real data is slow, expensive, and often requires domain experts who have better things to do.

Synthetic data — training examples generated by LLMs themselves — has emerged as the practical solution. Nearly every major model released in the past year has been trained, at least in part, on synthetically generated data. The technique works. But the details of how you generate that data determine whether you end up with a better model or an expensive echo chamber.

The core risk: model collapse

Training a model on its own outputs — or on outputs from a model in the same family — creates a feedback loop. Each generation smooths out the distribution, loses tail cases, and converges toward a narrower range of outputs. After enough iterations, the model loses the ability to handle edge cases entirely.

This is not a theoretical risk. It has been documented in multiple research settings and shows up in production as a model that performs well on common inputs but fails unpredictably on anything unusual.

The diversity tax Synthetic data tends to be less diverse than real data by default. Every generation pipeline needs explicit mechanisms to maintain diversity — or the resulting model will be confidently mediocre.

Designing a generation pipeline that works

Start with real seed data

The best synthetic data pipelines don't start from scratch. They start with a small set of high-quality, real-world examples — typically 50–200 — and use those as seeds for generation. The seeds define the target distribution: the topics, the difficulty levels, the edge cases you care about.

Use structured variation

Naive generation ("generate 1,000 examples like this one") produces homogeneous data. Instead, systematically vary the inputs along dimensions that matter for your task. For a customer support classifier, that might mean varying the tone (angry, confused, polite), the complexity (single issue, multiple issues), and the domain (billing, technical, account management).

The most effective approach we've seen uses a generation matrix: a grid of attributes that ensures coverage across the space of inputs you expect in production.

Filter aggressively

Not all generated examples are useful. A good pipeline generates 3–5× more data than it keeps, filtering on criteria like correctness (does the output actually answer the input?), consistency (does it follow the format?), difficulty (is it too easy or too hard?), and deduplication (is it too similar to existing examples?).

The filtering step is where most of the quality comes from. Skimp on it and you're training on noise.

Quality verification strategies

LLM-as-judge

Using a stronger model to evaluate the outputs of a generation model is now standard practice. The judge model scores each example on relevance, correctness, and adherence to instructions. Examples below a threshold are discarded.

This works well for most tasks but has a blind spot: the judge and the generator share similar biases. They'll agree on outputs that "sound right" even when they're subtly wrong. For high-stakes domains, supplement LLM-as-judge with human spot-checks.

Consistency checks

Generate the same example multiple times and check whether the outputs agree. High variance across generations suggests the task is ambiguous or the prompt is under-specified — both signals that the data point should be revised or discarded.

Held-out validation

Always keep a set of real, human-verified examples as a held-out validation set. Train on synthetic data, evaluate on real data. If the gap between synthetic-data performance and real-data performance is large, your generation pipeline has a distribution mismatch.

Common failure modes

The goal of synthetic data is not to replace real data. It's to amplify the signal in the real data you already have.

A practical recipe

For teams getting started with synthetic data generation, here's a workflow that consistently produces good results:

  1. Collect 100–200 real examples that represent your target task well
  2. Define a variation matrix with 4–6 dimensions relevant to your domain
  3. Generate 5–10× your target dataset size using a frontier model
  4. Filter using LLM-as-judge plus automated quality checks
  5. Deduplicate using embedding similarity (threshold: ~0.92 cosine)
  6. Validate on a held-out set of real examples
  7. Iterate on the generation prompts based on where the model underperforms

This loop typically requires 2–3 iterations before the data quality is good enough for training. Budget your time accordingly.

Link copied!