Reinforcement Learning from Human Feedback — a training technique that aligns language models with human preferences by using human judgments of model outputs as a reward signal, making models more helpful, harmless, and honest.
In Depth
RLHF is a three-stage training process that fine-tunes language models to better align with human intentions and values. Stage one: a base language model is pre-trained on vast text data using standard next-token prediction. Stage two: human evaluators rank multiple model responses to the same prompt — choosing which response is more helpful, accurate, and appropriate. These rankings are used to train a separate Reward Model that learns to predict human preferences. Stage three: the language model is further trained using reinforcement learning (specifically PPO — Proximal Policy Optimization) to maximize the reward model's scores, gradually shifting its outputs toward what humans prefer.
RLHF was the key innovation that transformed GPT-3 (a capable but raw text predictor) into ChatGPT (a helpful, safe, conversational assistant). Before RLHF, language models would frequently generate toxic, harmful, or unhelpful content because they were simply mimicking patterns in internet text. RLHF allowed researchers to steer the model toward being helpful while avoiding harmful outputs — effectively training the model's 'judgment' about what constitutes a good response. The same technique, with variations, is used to train Claude, Gemini, and other modern AI assistants.
RLHF has limitations and active areas of improvement. Human preferences are subjective and inconsistent — different evaluators may disagree, and biases in the evaluation workforce can be baked into the model. The reward model can be 'hacked' — the language model may learn to produce responses that score well on the reward model without genuinely being better. Alternatives like Direct Preference Optimization (DPO) simplify the process by eliminating the separate reward model. Constitutional AI (Anthropic's approach) uses AI-generated feedback guided by principles, reducing reliance on human labor while maintaining alignment.
RLHF trains language models to align with human preferences by using human judgments as a reward signal — it is the technique that transforms raw language models into helpful, safe AI assistants.