RLHF trains a separate reward model on human preference data (humans rank pairs of model outputs), then uses reinforcement learning to push the LLM toward outputs the reward model rates higher. It is what turned GPT-3 into ChatGPT and is responsible for the helpful, polite default behaviour of modern frontier models.
RLHF is expensive and labour-intensive (large teams of human annotators) and has known limitations (reward hacking, sycophancy). DPO (Direct Preference Optimisation) and similar techniques in 2026 are simpler alternatives that produce comparable alignment with less infrastructure.