RLHF

Reinforcement Learning from Human Feedback is a training phase used on LLMs after supervised fine-tuning to further improve LLM responses. It was one of the key innovations behind ChatGPT.

Feb 15, 2025

RLHF

Reinforcement Learning from Human Feedback (RLHF) is a training phase used on LLMs after supervised fine-tuning to further improve LLM responses. In contrast to supervised fine-tuning, where a model learns to mimic responses for given prompts, in RLHF a model learns by generating a response and then receiving a score indicating how good that response is. Because using humans to assign response scores is expensive, scores are typically generated using an LLM reward model (RM) trained on human preference pairs (prompt, winning response, losing response). RLHF uses a reinforcement learning algorithm like proximal policy optimization (PPO) to update model parameters based on the scores from the RM.

RLHF was one of key innovations behind ChatGPT. Collecting high-quality data for training a RM can be very resource intensive. RLHF is often used to improve helpfulness and minimize harmfulness of model responses.

One Minute NLP

Discussion about this post

One Minute NLP

RLHF

Reinforcement Learning from Human Feedback is a training phase used on LLMs after supervised fine-tuning to further improve LLM responses. It was one of the key innovations behind ChatGPT.

RLHF

Further reading

Do you want to learn more NLP concepts?

Discussion about this post