Viqus Logo Viqus Logo
Home
Categories
Language Models Generative Imagery Hardware & Chips Business & Funding Ethics & Society Science & Robotics
Resources
AI Glossary Academy CLI Tool Labs
About Contact
Back to Glossary
Technical Concepts Beginner Also: Training Epoch, Full Pass

Epoch

Definition

One complete pass through the entire training dataset during model training — a unit of training progress used to track how many times every training example has been seen by the model.

In Depth

An epoch is a fundamental unit of training measurement in machine learning. During one epoch, the model processes every training example exactly once — calculating predictions, computing the loss, and updating weights via backpropagation. After each epoch, the model has 'seen' the full dataset once, and its parameters have been updated multiple times (once per batch). Training typically requires many epochs before convergence — the point at which the loss stops meaningfully decreasing.

The number of epochs to train for is a hyperparameter. Too few epochs and the model underfits — it hasn't learned enough from the data. Too many epochs and the model overfits — it has memorized the training data including noise, and performance on validation data degrades. Early stopping is the standard solution: monitor validation loss after each epoch and stop training when it begins to increase, keeping the model checkpoint that achieved the best validation performance.

Epochs interact with batch size and learning rate. Training for 10 epochs with a batch size of 32 involves many more gradient update steps than training for 10 epochs with a batch size of 512, even though both see the same data the same number of times. Larger batches compute more stable gradient estimates but take fewer update steps per epoch. This interaction — along with learning rate scheduling — means that epoch count alone is insufficient to characterize a training run; it must be considered alongside the full training configuration.

Key Takeaway

An epoch is how we measure a model's exposure to data — repeated full passes through training examples are how neural networks refine their internal representations from coarse pattern-matching to nuanced understanding.

Real-World Applications

01 Training early stopping: monitoring validation loss after each epoch to determine the optimal stopping point before overfitting.
02 Learning rate scheduling: reducing the learning rate at specific epochs (e.g., divide by 10 at epochs 30 and 60) to fine-tune convergence.
03 Training cost estimation: multiplying epoch count by dataset size and batch size to estimate total compute requirements.
04 Transfer learning fine-tuning: running only 1-5 epochs on task-specific data when fine-tuning pre-trained models to avoid overfitting.
05 Training monitoring: logging loss, accuracy, and other metrics per epoch to visualize learning curves and diagnose training issues.

Frequently Asked Questions

How many epochs should I train for?

There's no universal answer — it depends on dataset size, model complexity, and the task. Common practice: start with a generous number (50-100 epochs) and use early stopping to halt when validation performance stops improving. For LLM pre-training, models typically see each data point only 1-4 times (1-4 epochs). Overtrained models overfit; undertrained models underfit. Monitor validation loss to find the sweet spot.

What is the difference between an epoch, a batch, and an iteration?

An epoch is one complete pass through the entire training dataset. A batch is a subset of training examples processed together in one forward/backward pass. An iteration is the processing of one batch. If you have 10,000 samples and a batch size of 100, then 1 epoch = 100 iterations. These three concepts define the fundamental rhythm of model training.

Can more epochs always improve a model?

No. After a certain point, additional epochs lead to overfitting — the model memorizes training data and performs worse on new data. The learning curve (training vs. validation loss over epochs) typically shows validation loss decreasing then increasing. The optimal epoch count is where validation loss is minimized. Early stopping automates this by halting training when validation loss stops decreasing.