Glossary entry

RLHF (Reinforcement Learning from Human Feedback)

A training technique that uses human preference data to align an LLM's output with what people actually want.

RLHF trains a separate reward model on human preference data (humans rank pairs of model outputs), then uses reinforcement learning to push the LLM toward outputs the reward model rates higher. It is what turned GPT-3 into ChatGPT and is responsible for the helpful, polite default behaviour of modern frontier models.

RLHF is expensive and labour-intensive (large teams of human annotators) and has known limitations (reward hacking, sycophancy). DPO (Direct Preference Optimisation) and similar techniques in 2026 are simpler alternatives that produce comparable alignment with less infrastructure.

Related terms

Written by

John Ethan

Founder & Editor-in-Chief

Founder of MytheAi. Tracking and reviewing AI and SaaS tools since January 2026. Built MytheAi out of frustration with pay-to-rank listicles and SEO-driven AI directories that prioritize ad revenue over honest guidance. Hands-on testing across 500+ tools to date.