Proximal Policy Optimization

Proximal Policy Optimization is frequently used in Reinforcement Learning from Human Feedback to further train LLMs after supervised fine-tuning. It was used to train InstructGPT and ChatGPT.

Feb 25, 2025

Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a reinforcement learning (RL) algorithm that is frequently used in Reinforcement Learning from Human Feedback (RLHF) to further train LLMs after supervised fine-tuning. In RLHF, an LLM output receives a numerical reward that represents human preference; the goal is to maximize this reward. PPO uses an actor-critic setup, training two models simultaneously. The actor (also called policy) is the LLM that generates tokens; the critic estimates the expected final reward at each token—these estimates are used to calculate gradients for the policy LLM. PPO constrains the size of weight updates on the policy LLM at each training step to prevent drastic and unpredictable behavior changes.

Simplified workflow depicting models involved in PPO (actor, critic, and reward model) and how they interact.

PPO is the standard algorithm for RLHF; it was used to fine-tune InstructGPT and ChatGPT. PPO doesn’t prevent reward hacking—achieving high rewards through learning to exploit imperfections in the reward model.

One Minute NLP

Discussion about this post

One Minute NLP

Proximal Policy Optimization