Reinforcement Learning for LLM Alignment: A Primer

This blog post is my attempt to piece together a clear picture of Reinforcement Learning from Human Feedback (RLHF). I'll cover everything from its benefits to the mathematical foundations of learning algorithms for preference data. This is a derivative work, and I’ve cited all the resources I’ve used. Since RLHF is an active area of research, this will be a living document that I’ll update regularly.

Outline

Overview
Instruction Finetuning
RL from Human Feedback
Algorithms for Learning from Preference Data
References

Overview

A language model (LM) pre-trained on the next-token-prediction task can be jerry-rigged ^[1] to perform multiple tasks. For example, if you want a quick recipe for brownies, you can get the LM to complete this prompt "Here is a step-by-step recipe for brownies which can be made in under 30 minutes:" and the LM may generate a good recipe provided the generated token sequence is the most likely given your prompt.

The most likely output token sequence of an LM does not always accomplish the task that a user has in mind. For example, if the prompt is "How much does the earth weigh?", the user expects the LM to return the Earth's weight, but the LM might generate "How much does the moon weigh?" instead. An aligned model will "follow the user's instructions helpfully and safely". ^[3]

Instruction Finetuning

To ensure that an LM's completions follow the user's instruction, a pre-trained LM is further finetuned in a supervised manner on (prompt, completion) pairs. Given a prompt written as an instruction (e.g. give me a quick brownie recipe), the LM should generate the corresponding completion (in this case, a quick brownie recipe). The objective optimized in pre-training and finetuning is the same - maximizing log-likelihood of the correct token at time \(t\), given tokens \(0, \cdots, t-1\).

Maximizing log-likelihood penalizes each incorrect token equally during training. For example, in 'This chair is made of ___', each of 'wood' and 'iron' is a reasonable completion. However, if the training data heavily favors 'wood' and the model predicts 'iron' once and 'cotton' another time, the training objective will penalize both equally, even though 'iron' is a much more suitable completion than 'cotton'. We observe that the objective being optimized is not aligned with human preferences ^[1]. While collecting a large number of (prompt, completion) pairs can help capture a wide range of human preferences, scaling this process through manual annotation is costly. RLHF helps address these issues by:

Optimizing for an objective that better reflects human preferences, rather than maximizing the log-likelihood of sequences from the pre-training or instruction-tuning corpus.
Scaling the costly process of collecting annotated (prompt, completion) pairs. Instead of relying on a large dataset to capture human preferences, we model those preferences (using fewer datapoints) and use that model to guide the RLHF objective.

RL from Human Feedback ^[2]

Modeling Human Preferences

We want to infer the human reward function, without having humans write a large number of (prompt, completion) pairs. One option is to ask them to assign a score to a given completion, but this often results in disagreements between annotators. A better approach is to present two competing completions and ask which one they prefer, a method that leads to greater annotator agreement. While this approach does not directly provide the reward function (i.e., one that maps completions to numerical scores), we can infer it indirectly using models like the Bradley-Terry Model, which links human preferences to their underlying reward functions. I want to note here that a model (like Bradley-Terry) offers us a useful approximation of some complex phenomemon (in our case, human preferences) - useful in the sense that it allows us to study some properties of that phenomenon.

Under the Bradley-Terry Model, if the underlying human reward function is \(s\) (assume a single reward function for the entire humanity), then the probability that a completion \(a\) is preferred over \(b\) is modeled as (here \(\sigma\) is the Sigmoid function): \[ p(a, b) = \sigma \left(s(a) - s(b)\right) \] We model \(s\) using a neural network \(r_{\phi}\) with parameters \(\phi\). Given a dataset of preference pairs \(\mathcal{D} = \{x^{(i)}, y^{(i)}_+, y^{(i)}_-\}\) where \(x^{(i)}\) are the prompts and \(y^{(i)}_+, y^{(i)}_-\) are the preferred/not-preferred completions, we can estimate \(\phi\) by maximizing the log-likelihood of \(\mathcal{D}\) under the Bradley-Terry Model: \[ \phi_{MLE} = \arg \max_\phi \sum_{i=1}^{|\mathcal{D}|} \log \left(\sigma \left(r_\phi(x^{(i)}, y^{(i)}_+) - r_\phi(x^{(i)}, y^{(i)}_-)\right)\right) \]

Earlier in the Instruction Finetuning section, we mentioned that the maximum likelihood objective does not always align with human preferences. Now that we have an explicit model for human preferences (our reward model \(r_\phi\)), we can optimize our LM to generate completions that receive high rewards (as measured by \(r_\phi\)). Formally, we want to maximize the expected reward of samples from our LM ^[1]. Let \(\pi_\theta\) represent our LM (with parameters \(\theta\)), then our objective is (here the expectation is over the distribution of sequences \(s\) represented by the LM \(\pi_\theta\)): \[ \max_\theta \mathbb{E}_{s \sim \pi_\theta} \left[r_\phi(s)\right] \] This optimization cannot be performed using gradient descent as computing the gradient of this expectation with respect to \(\theta\) is challenging. We use policy gradient methods from RL to optimize this objective. In RL, a policy is a mapping from a "state space" S to an "action space" A. An LM can be viewed as a policy that maps each sequence in the set of all possible token sequences \(S\), so the next token in that sequence (the set of all tokens is the action space \(A\)).

Policy Gradients: High Level Overview ^[1]

We want to compute this gradient (known as the "policy gradient"): \[ \begin{align} \nabla_\theta \mathbb{E}_{s \sim \pi_\theta}\left[r_\phi(s)\right] &= \nabla_\theta\sum_s \pi_\theta(s)r_\phi(s) \\ &= \sum_s r_\phi(s) \nabla_\theta \pi_\theta(s) \end{align} \] The expression above for gradient-of-expected-reward, requires us to enumerate all possible trajectories \(s\) under our policy \(\pi_\theta\). If we could somehow transform this expression into a an expectation under \(\pi_\theta\), we could compute a sampling based approximation of that expectation (see Lecture 3^[5]). To achieve that, we use the log-derivative trick, \(\nabla_\theta \log \left(\pi_\theta(s)\right) = \frac{1}{\pi_\theta(s)}\nabla_\theta \pi_\theta(s)\):: \[ \begin{align} \nabla_\theta \mathbb{E}_{s \sim \pi_\theta}\left[r_\phi(s)\right] &= \sum_s r_\phi(s) \pi_\theta(s) \nabla_\theta \log \left(\pi_\theta(s)\right) \\ &= \mathbb{E}_{s \sim \pi_\theta} \left[r_\phi(s) \nabla_\theta \log \left(\pi_\theta(s)\right)\right] \\ &\approx \frac{1}{m}\sum_{i=1}^m r_\phi(s^{(i)}) \nabla_\theta \log \left(\pi_\theta(s^{(i)})\right) \end{align} \] In RL terminology, we say that in order to estimate gradient, we sampled \(m\) "rollouts" (also called "episodes" or "trajectories") of our policy model.

Policy Training Prompts ^[4]: In practice, instead of computing policy gradients for entire sequences \(s\), we start with a set of prompts \(\mathcal{D}_\pi\). For each prompt \(x \in \mathcal{D}_\pi\), we sample a completion \(y\) from the policy model \(y \sim \pi_\theta(\cdot | x)\). We compute the reward \(r_\phi(x, y)\). Our objective looks like the following: \[ \nabla_\theta \mathbb{E}_{s \sim \pi_\theta}\left[r_\phi(s)\right] \approx \frac{1}{|\mathcal{D}_\pi|} \sum_{x \in \mathcal{D}_\pi, \\ y \sim \pi(\cdot | x)} r_\phi(x, y) \nabla_\theta \log \left(\pi_\theta(y | x)\right) \]

The RLHF Pipeline ^[1]

Using all the components and techniques discussed above, here is how an RLHF pipeline looks like:

Start with a pre-trained and instruction finetuned LM \(\pi_\theta^{ref}\). Clone this reference model's weights to create a copy \(\pi_\theta\). We'll be updating weights of just the cloned model during preference tuning.
Collect a human preference dataset \(\mathcal{D}\) and train the reward model \(r_\phi\) (this is usually^[4] an LM with its language modeling head replaced with a regression head).
Instead of using \(r_\phi\) directly, we'll modify it as follows: \[ r(x, y) = r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_\theta^{ref}(y|x)} \] This adjustment ensures that \(\pi_\theta\) (the LM which is learning from preference data) does not veer away too much from \(\pi_\theta^{ref}\). If \(\pi_\theta\) assigns a higher probability to a completion \(y|x\), relative to the original pre-trained and instruction finetuned LM, the reward \(r(x, y)\) will be adjusted lower.
Finally, use policy gradient methods to maximize expected adjusted reward \(\mathbb{E}_{x \in \mathcal{D}_\pi, y \sim \pi_\theta(\cdot | x)} \left[r(x, y)\right]\) with respect to \(\theta\).

Algorithms for Learning from Preference Data

Ivison et. al.^[4] identified four core aspects for preference-learning, in order of importance: quality of preference data, choice of learning algorithm, quality of reward models, and the policy training prompt datasets. In this section I'll focus on preference-learning algorithms.

Proximal Policy Optimization (PPO)

Recall the expression for our sampling based policy-gradient estimate, with \(r_\phi\) replaced with the KL-adjusted reward \(r\) (here \(m = |\mathcal{D}_\pi|\), the size of our policy prompts set; \(x^{(i)} \in \mathcal{D}_\pi\) and \(y^{(i)} \sim \pi(\cdot | x^{(i)})\)) : \[ \nabla_\theta \mathbb{E}_{s \sim \pi_\theta}\left[r(s)\right] \approx \underbrace{\frac{1}{m}\sum_{i=1}^m r(x^{(i)}, y^{(i)}) \nabla_\theta \log \left(\pi_\theta(y^{(i)} | x^{(i)})\right)}_{\text{sampling based policy-gradient estimate}} \] Consider the contribution to this estimate from one prompt \(x \in \mathcal{D}_\pi\), that is \(r(x, y) \nabla_\theta \log \left(\pi_\theta(y | x)\right)\). Define state \(s_t = x \circ y_{<t}\) as a concatenation of a prompt \(x\) and a partial completion \(y_{<t}\) (the sequence of tokens generated until and including time step \(t-1\)). We can write the LM's trajectory as an alternating sequences of states and actions (tokens generated by our LM \(\pi_\theta\)), \(s_1, y_1, s_2, y_2, \cdots, s_H, y_H\), where \(H\) is the "horizon" or number of completion tokens generated by the LM. We can write the gradient contribution for this prompt as: \[ \begin{align} r(x, y)\nabla_\theta \log \left(\pi_\theta(y | x)\right) &= r(x, y) \nabla_\theta \log \prod_{t=1}^H \pi_\theta(y_t | s_t) && \\ &= \sum_{t=1}^H \Psi_t \nabla_\theta \log \pi_\theta(y_t | s_t) && \left(\Psi_t = r(x, y), t \in \{1,\cdots,H\}\right) \end{align} \]

For a single rollout, we compute the gradient \(\nabla_\theta \log \pi_\theta(y_t | s_t)\) for time-step \(t\) using backpropogation through our policy \(\pi_\theta\). We compute the weighted sum of each time-step's gradients (with the weights being \(\Psi_t\) for time-step \(t\)), to get the contribution of prompt \(x\) to our sampling bases policy gradient estimate. While we used the sequence level reward \(r(x,y)\) as the weight \(\Psi_t\) for all time-steps, there are several other valid choices for \(\Psi_t\)^[7], two of which are discussed below. Before going into them, let us also define token-level (i.e. time-step level) rewards (see Appendix F.2^[4]): \[ r_t = \begin{cases} - \beta \log \frac{\pi_\theta(y_t|x \circ y_{< t})}{\pi^{ref}_\theta(y_t|x \circ y_{< t})} & 1 \leq t < |y| \\ - \beta \log \frac{\pi_\theta(y_t|x \circ y_{< t})}{\pi^{ref}_\theta(y_t|x \circ y_{< t})} + r_\phi(x, y) & t = |y| \end{cases} \] Here are two valid choices for \(\Psi_t\):

The On-policy Action-Value Function: The function \(Q^{\pi_\theta}(s, y)\) denotes the expected return if the agent starts in state \(s\) and takes an action \(y\) (under the policy \(\pi_\theta\)). We can define our time-step level reward weight as: \[ \Psi_t(x,y) = Q^{\pi_\theta}(x \circ y_{< t}, y_t) \] OpenAI's Spinning Up RL tutorial offers a proof^[14] for validity of using \(Q^{\pi_\theta}(s, y)\) in the policy-gradient formula.
The Advantage Function: Given a policy and a state \(s\), it measures how valuable it is to take a specific action relative to an average across all possible actions. Formally, we define \(A^{\pi_\theta}(s, y) = Q^{\pi_\theta}(s, y) - V^{\pi_\theta}(s)\) where the Value Function \(V^{\pi_\theta}\) maps a state \(s\) to the expected total reward that could be achieved from starting in state \(s\). We define \(\Psi_t\) as: \[ \Psi_t(x, y) = A^{\pi_\theta}(x \circ y_{< t}, y_t) \]

Schulman et. al.^[8] note the following about the variance of our sampling based policy-gradient estimate:

The choice \(\Psi_t = A^{\pi_\theta}(s_t, a_t)\) yields almost the lowest possible variance, though in practice, the advantage function is not known and must be estimated.

... \(A^{\pi_\theta}\) ... measures whether or not the action is better or worse than the policy’s default behavior. Hence, we should choose \(\Psi_t\) to be the advantage function \(A^{\pi_\theta}\) ... so that the gradient term \(\Psi_t \nabla \log \pi_{\theta}(a_t|s_t) \) points in the direction of increased \(\pi_{\theta}(a_t|s_t)\) if and only if \(A^{\pi_\theta}(s_t, a_t) > 0\).

Two takeaways from this: (1) we want to use the advantage function in our policy-gradient estimation, and (2) we need a way to estimate the advantage function itself. The latter is done using Generalized Advantage Estimation, which I'll discuss next. Before I do that, let me introduce the concept of "discounting".

Note that \(\Psi_t(x,y)\) not only includes contribution from reward received for the immediate action \(r(s_t, y_t)\), but also from rewards associated with future actions in the rollout. For example, if \(\Psi\) is the action-value function, we have (here \(r_t = r(s_{t}, y_{t})\)): \[ Q^{\pi_\theta}(s_t, y_t) = \mathbb{E}\left[r_t + r_{t+1} + r_{t+2} + \cdots\right] \] We take the view that an action often has more influence in the short term than in the long term, and so we want the weight \(\Psi_t\) to put less weight on future rewards (see Lecture 3^[5]). We do so by using a "discount factor" \(\gamma \in (0,1)\): \[ Q^{\pi_\theta, \gamma}(s_t, y_t) = \mathbb{E}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots\right] \]

Generalized Advantage Estimation (GAE)

Including the discount in our expressions for \(Q, V\), we have: \[ A^{\pi_\theta, \gamma}(s, y) = Q^{\pi_\theta, \gamma}(s, y) - V^{\pi_\theta, \gamma}(s) \] \(Q(s_t, y_t)\) is an expectation, and we could estimate it by sampling rollouts (starting with state \(s_t\) and action \(y_t\)), and then averaging sum-of-discounted-rewards across samples. Such estimates have a high variance (Section 3.2.2 ^[6]) and we would only use them if we have enough samples to get an accurate estimate. An alternate method to estimate \(Q(s_t, y_t)\) is that for the first \(k\) time-steps, we use actual rewards observed in a sample of rollouts, and then we use the value function \(V\) to approximate future rewards (i.e., we use \(V(s_{t+k})\) where \(s_{t+k}\) may be different across rollout samples). Let me come back to how we can approximate \(V\). This is our updated expression for \(Q\): \[ Q^{\pi_\theta, \gamma}(s_t, y_t) = \mathbb{E}\left[r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V(s_{t+k})\right] \] Intuition behind this formulation (see Lecture 3^[5]): When we use only empirical rewards (over multiple sampled rollouts) to estimate \(Q\), our estimate has zero bias (i.e. expected value of our estimate is the true value), but likely high variance (we need a large number of samples for low variance). Using Value Function in the computation leads to non-zero bias,but reduces variance (because the Value Function is based on several past experiences).

We can now construct an estimator for the advantage function using this \(k\)-step estimator of \(Q\) and the value function \(V\) (the expression below is for a single rollout sample): \[ \hat{A}^{(k)}(s_t, y_t) = \left[r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V(s_{t+k})\right] - V(s_t) \]

Instead of using a single value of \(k\) to estimate advantage, GAE takes an exponentially-weighted average of multiple such \(k\)-step estimators (see Section 3^[8]) (here \(\hat{A}^{(k)}_t\) denotes \(\hat{A}^{(k)}(s_t, y_t)\)): \[ \hat{A}^{\text{GAE}(\lambda, \gamma)}(s_t, y_t) = (1-\lambda)\left(\hat{A}^{(1)}_t + \lambda \hat{A}^{(2)}_t + \lambda^2 \hat{A}^{(3)}_t + \cdots \right) \] Define \(\delta^V_t = \left[r_t + \gamma V(s_{t+1})\right] - V(s_t)\) (called the "temporal-difference residual of \(V\) with discount \(\gamma\)")^[8]. We can simplify the expression to: \[ \hat{A}^{\text{GAE}(\lambda, \gamma)}(s_t, y_t) = \sum_{l=0}^\infty (\gamma\lambda)^l \delta_{t+l}^V \] Now let's talk about how to estimate the value function \(V\).

Value Function Estimation

The Value Function gives expected sum of discounted future rewards, given a state (a prompt plus a partial completion). We approximate it using a value model \(V_\omega\) which usually has the same architecture as our reward model (a transformer with the LM head replace with a regression head)^[4] with \(\omega\) denoting the model's paramters. We train \(V_\omega\) by minimizing the squared error between model's return prediction (\(V_\omega(s_t)\)) and empirical return (\(\hat{R}_t = \sum_{l=0}^\infty \gamma^l r_{t+l}\)): \[ \mathcal{L}_V(\omega) = \mathbb{E}_{x \in \mathcal{D}_\pi, y \sim \pi_\theta(\cdot | x), t \in [1, |y|]}\left[\frac{1}{2}\left(V_\omega(x \circ y_{< t}) - \hat{R}_t\right)^2\right] \]

Putting it together: The PPO Algorithm

PPO trains policy model weights \(\theta\) by minimizing the following loss (Appendix F.2^[4]): \[ \begin{align} \mathcal{L}_\pi(\theta) &= -\mathbb{E}_t \left[ \frac{\pi_\theta (y_t|s_t)}{\pi_{\theta_{old}} (y_t|s_t)} \hat{A}^t \right] \\ &\approx -\frac{1}{|\mathcal{D}_\pi|} \sum_{x^{(i)} \in \mathcal{D}_\pi, \\ y^{(i)} \sim \pi_{\theta_{old}}(\cdot | x^{(i)})} \sum_{t=1}^{|y^{(i)}|} \frac{\pi_\theta (y^{(i)}_t|s^{(i)}_t)}{\pi_{\theta_{old}} (y^{(i)}_t|s^{(i)}_t)} \hat{A}^{(i)}_t \end{align} \] We initialize \(\pi_{\theta_{old}}\) with \(\pi^{ref}_\theta\), and update it after each gradient step. The expectation \(\mathbb{E}_t\) is over all trajectories (token sequences) sampled from the old policy \(\pi_{\theta_{old}}\) and \(\hat{A}^{(i)}_t\) denotes \(\hat{A}^{\text{GAE}(\lambda, \gamma)}(s^{(i)}_t, y^{(i)}_t)\).

This loss formulation can lead to really large gradient updates, for example when \(\pi_\theta (y_t|s_t) >> \pi_{\theta_{old}} (y_t|s_t)\). To ensure that the updated policy does not veer too far from the old policy, we usually clip the policy probability ratio to between \((1-\epsilon, 1+\epsilon)\) where \(\epsilon\) is tunable, resulting in this formulation: \[ \mathcal{L}_\pi(\theta) = -\mathbb{E}_t \left[ \min \left( \frac{\pi_\theta (y_t|s_t)}{\pi_{\theta_{old}} (y_t|s_t)} \hat{A}^t, \text{clip}\left(\frac{\pi_\theta (y_t|s_t)}{\pi_{\theta_{old}} (y_t|s_t)}, 1-\epsilon, 1+\epsilon \right)\hat{A}^t \right)\right] \] In practice, we optimize the Policy and Value models jointly, by minimizing the combined loss: \[ \mathcal{L}_{PPO}(\theta, \omega) = \mathcal{L}_\pi(\theta) + \alpha \cdot \mathcal{L}_V(\omega) \]

Direct Preference Optimization (DPO) ^[9]

The PPO-RLHF pipeline is complex (just take a look at Figure 1^[6]). It requires training three models (the reward model \(r_\phi\), the value function \(V_\omega\), and the policy \(\pi_\theta\)). DPO proposes a simple approach for learning from preference data, without having to train any reward or value models, or having to sample "rollouts" from interim policy models.

The DPO Objective

DPO considers the following objective (it equals the expected rewards under a policy \(\pi_\theta\) less a KL penalty which ensures the policy stays close to the reference LM \(\pi^{SFT}\)): \[ \max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot | x)} \left[r(x,y) \right] - \beta \mathbb{D}_{KL}\left[ \pi_\theta(y|x) || \pi^{SFT}(y|x)\right] \] Here \(r\) is a general reward function (which we do not know), and \(\beta\) is a hyperparameter. The authors show that the optimal solution to this optimization problem takes the following form: \[ \pi(y|x) = \frac{1}{Z(x)} \pi^{SFT}(y|x) \exp{\left(\frac{1}{\beta} r(x,y)\right)} \] Authors note that the partition function, \(Z(x) = \sum_y \pi^{SFT}(y|x) \exp{\left(\frac{1}{\beta} r(x,y)\right)}\), is hard to estimate - even if we use a learned reward model \(r_\phi\) (like the one we learnt in our RLHF pipeline above), we'll need to compute this sum over a large number of \(y\) samples, which makes this expensive. It turns out that we don't need to estimate the partition function! Take logarithms on both side of the equation, and rearrange to get: \[ r(x,y) = \beta \log \frac{\pi(y|x)}{\pi^{SFT}(y|x)} + \beta \log Z(x) \] Using this formulation of reward, consider a preference pair datapoint \((x, y^+, y^-)\). Under the Bradley-Terry model, we can express the probability that \(y^+\) is a preferred completion (relative to \(y^-\)) for prompt \(x\). This probability can be written entirely in terms of the policy \(\pi\) as the terms involving \(Z\) cancel out! \[ \begin{align} p(y^+, y^- | x) &= \sigma\left(r(x,y^+) - r(x,y^-)\right) \\ &= \sigma\left(\beta \log \frac{\pi(y^+|x)}{\pi^{SFT}(y^+|x)} - \beta \log \frac{\pi(y^-|x)}{\pi^{SFT}(y^-|x)}\right) \end{align} \]

We can now find the optimal policy by maximizing the log-likelihood of our preference pair dataset \(\mathcal{D}\) under the preference-probability model above: \[ \mathcal{L}_{DPO}(\theta) = -\frac{1}{|\mathcal{D}|} \sum_{(x, y^+, y^-) \in \mathcal{D}} \log \sigma\left(\beta \log \frac{\pi_\theta(y^+|x)}{\pi^{SFT}(y^+|x)} - \beta \log \frac{\pi_\theta(y^-|x)}{\pi^{SFT}(y^-|x)}\right) \]

DPO vs. PPO

DPO is superior to PPO as far as ease of training is concerned (no need to train reward and value models; no need to draw sample rollouts in each training step). When it comes to performance, a number of research works find PPO superior to DPO. For example, Xu et. al. ^[10] note:

Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.

Ivison et. al. ^[4]:

PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains.

Given the large number of variables involved in learning from preference data (be it PPO or DPO), I believe I'll have to keep updating this section frequently.

Group Relative Policy Optimization (GRPO)

GRPO (notably used in DeepSeek R1^[11]), was introduced by Shao et. al. (2024)^[12] as:

... a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

GRPO does not require training the Value model \(V_\omega\) (thus saving memory), and instead estimates Advantage in a different manner. For each prompt \(x \epsilon \mathcal{D}\), we first draw \(G\) samples from \(\pi_{\theta_{old}}(\cdot | x)\), denoted by \((y_{o_1}, y_{o_2},\cdots, y_{o_G})\) (a "group" of completions). We use our reward model \(r_\phi\) to score each of the \(G\) completions to get a set of rewards \(\mathbb{r} = (r_{o_1}, \cdots, r_{o_G})\). We do not adjust the model reward for a per-token KL penalty (like we did above for PPO and DPO). Instead we add a KL term directly to the GRPO loss objective. In the version of GRPO where we use RL to supervise the eventual outcome (and not the process), we use the same Advantage estimate \(\hat{A}_{i}^{t}\) for each token (at time-step \(t\)) in the completion \(y_{o_i}\). We estimate this Advantage as: \[ \hat{A}_{i}^{t} = \frac{r_{o_i} - \text{mean}(\mathbb{r})}{\text{std}(\mathbb{r})} \]

The GRPO loss objective takes a similar form as the clipped PPO objective (with an additional KL term - more on that below). Note that the expression below is written for just one prompt \(x\). In practice we would estimate this loss by averaging over a batch of prompts drawn from our prompt dataset \(\mathcal{D}\). \[ \mathcal{L}_{GRPO}(\theta) = \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_{o_i}|} \sum_{t=1}^{|y_{o_i}|} \min \left( \frac{\pi_\theta (y_{o_i,t}|s_{o_i,t})}{\pi_{\theta_{old}} (y_{o_i,t}|s_{o_i,t})} \hat{A}^t_i, \text{clip}\left(\frac{\pi_\theta (y_{o_i,t}|s_{o_i,t})}{\pi_{\theta_{old}} (y_{o_i,t}|s_{o_i,t})}, 1-\epsilon, 1+\epsilon \right)\hat{A}^t_i \right) - \beta \mathbb{D}_{KL} \left[\pi_\theta || \pi_\theta^{ref}\right] \]

Authors estimate the KL term in \(\mathcal{L}_{GRPO}\) using an unbiased estimator proposed by Schulman^[13]: \[ \hat{\mathbb{D}}_{KL} \left[\pi_\theta || \pi_\theta^{ref}\right] = \frac{\pi_\theta^{ref} (y_{o_i,t}|s_{o_i,t})}{\pi_{\theta} (y_{o_i,t}|s_{o_i,t})} - \log \frac{\pi_\theta^{ref} (y_{o_i,t}|s_{o_i,t})}{\pi_{\theta} (y_{o_i,t}|s_{o_i,t})} - 1 \]

References

Stanford University, CS224N: Natural Language Processing with Deep Learning, Lecture 10
Eric Mitchell, 2023, lecture on RLHF: Algorithms and Applications
OpenAI, 2022, Training language models to follow instructions with human feedback
Ivison et. al., 2024, Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Prof. Pieter Abbeel, Foundations of Deep RL Lecture Videos
Zheng et. al., 2023, Secrets of RLHF in Large Language Models Part I: PPO
OpenAI, Spinning Up, Other forms of Policy Gradients
Schulman et. al., 2015, High-Dimensional Continuous Control Using Generalized Advantage Estimation
Rafailov et. al., 2023, Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Xu et. al., 2024, Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
DeepSeek AI, DeepSeek-R1: Incentivizing Reasoning Capabilities in LLMs via Reinforcement Learning
Shao et. al., 2024 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
John Schulman, 2020 Approximating KL Divergence
OpenAI, Spinning Up, Proof for Using Q-Function in Policy Gradient Formula