Nemotron-4 340B Technical Report

NVIDIA released a technical report with details around their data curation, training, and alignment methodologies for their 340B model family called Nemotron-4 (comprised of base and instruct models, and a reward model). Things which I found interesting:

Heavy (and successful) use of synthetic data for both supervised finetuning and alignment with human preferences.
Introduction of a new preference tuning method RPO, which fixes some issues with DPO.
Use of intermediate checkpoints of aligned models to improve model performance.

Training Methodology

Pretraining
- Dataset: Comprises English natural language data (70%), multilingual (53 languages) data (15%), and source code from 43 programming langauges (15%), containing a total of 9 trillion tokens.
- Training: Performed in two phases, Phase 1 with 8T tokens, and Phase 2 with 1T tokens. The tokens used in Phase 2 were sampled from the master corpus (of 9T tokens) while placing a higher weight on (1) "higher quality sources" (not sure how quality was determined), and on (2) text containing question-answer style data (so that the model learns to respond to a question, with an answer to that question).
- Results: The base model produced from this pretraining beats Llama3 70B and Mistral 8x22B on all the benchmarks authors tested against (MMLU, Hellaswag, Winogrande, Arc-c, BBH, HumanEval), and Qwen-2 70B on four of them.
Alignment: Supervised Finetuning (SFT)
- Typical SFT datasets are a mixture of tasks. Authors hypothesize that this hurts performance on some tasks, and instead they propose a two stage SFT strategy with different task distributions in each stage's dataset.
- Stage 1 of SFT comprises of just coding data, containing 800K datapoints. The model was finetuned for 1 epoch on this dataset. In Stage 2, a task-diverse dataset of 200k datapoints is used (2% of these are datapoints from the Stage 1 coding SFT dataset).
Alignment: Preference Finetuning (PFT)
- Overview: Authors' PFT strategy uses multiple iterations of Direct Preference Optimization (DPO) and authors' proposed algorithm Reward-aware Preference Optimization (RPO). They train a checkpoint using DPO and then using it as the reference policy, they further train it using RPO. They perform three iteration of RPO, using one iteration's final policy to initialize the next iteration.
- DPO Loss Adjustment: DPO aims to maximize the reward gap between chosen and rejected responses of the policy model (the LLM that we will eventually use), for a given prompt. Authors note that while using DPO, there is a decline in the policy model's likelihoods for both chosen and rejected responses, even as their reward gap keeps widening, depite the chosen response being high quality (!).
  
  To mitigate this, the authors propose adding a weighted SFT loss on the chosen responses in addition to the DPO loss. This prevents the policy model from drifting away too much from the preference data (the prompt-chosen response pairs). In other words, this ensures the policy model's likelihood of chonse responses (given their prompts) doesn't decline.
- DPO Dataset: Authors select preference dataset triplets (prompt, chosen response, rejected response) which have high quality "chosen responses". Chosen responses which correspond to ground truth for a prompt, are considered high quality. In other cases, authors use the Nemotron-4 340B reward model's score as a measure of quality. They end up with 160k datapoints in this preference dataset, and use it to train the model for 1 epoch.
- RPO Loss: In some preference data triplets, rejected and chosen responses are close in quality, while the gap is quite wide for others. DPO doesn't utilize the extent of quality gap, and this can cause the policy model to assigning lower (than their quality "deserves") likelihoods to not-so-bad rejected responses.
  
  To fix this, the authors propose RPO which attempts to approximate the reward gap using the implicit reward defined by the policy model.
- RPO Dataset: Similar to construction of the DPO preference dataset, authors filter for higher quality chosen responses, although with a less harsh quality filter. They end up with 300k datapoints.

Synthetic Generation of Alignment Data

Authors synthetically generate more than 98% of the data used for alignment (SFT and PFT). Each of SFT pairs (prompt, response) and PFT triplets (prompt, chosen, rejected), rely on a prompt. Authors describe a strategy to curate a diverse set of prompts, followed by creation of SFT/PFT specific data.

Prompt Generation: Prompts are generated synthetically using the Mixtral 8x7B Instruct v0.1 model. Authors aim for three types of diversity in prompts: (1) task (e.g. writing, open QA), (2) topic (e.g. economics, daily life), and (3) instruction (e.g. json output, no. of paragraphs in response). Authors generate single turn, double turn, and instruction following prompts.
- Single Turn: A list of diverse macro topics is generated using the generator. More fine-grained topics are then generated by asking the generator to output related sub-topics. Finally, some topics are added manually to end up with a list of 3k topics. For each task, the generator is given a randomly sampled topic and asked to generate a prompt. It is then asked to regenerate the prompt by adding details (authors found prompts generated in the first attempt to be short).
- Two Turn: The format of a two-turn prompt is "User: xxx; Assistant: xxx; User: xxx". To creat this, sample a prompt from ShareGPT and using it as the first user prompt, generate the assistant response, and the next user question from "intermediate instruct models" (see the section on Iterative Alignment below - this refers to Mixtral 8x7b Instruct v0.1 in the first iterataion, and to Nemotron-4 340B Instruct intermediate checkpoints in subsequent iterations).
- Instruction Following Prompts: Start with a random single turn synthetic prompt, and add a synthetic instruction (e.g. "Your response should have two paragraphs.").
SFT Data
- Each SFT datapoint is a dialogue between "user" and "assistant" roles, with three turns in total. Authors start with a prompt (from the prompt set described above) and get an instruct model (same as the "intermediate instruct model" referred in the section on Two Turn above) to generate a response.
- The three turn dialogue is generated iteratively (five iterations in total), with the model simulating either the User or the Assistant. For User turns, the model is given cues about expected behavior, e.g. "Make the question complex."
PFT Data
- First, train a reward model (Nemotron-4 340B Reward) using 10k human annotated datapoints from HelpSteer2. Next, from a diverse prompt set (containing synthetic single-turn/instruction/two-turn prompts, as well as real world prompts from sources like LMSYS), generate responses using multiple models. Finally, generate (prompt, chosen response, rejected) triplets based on preference rankings assigned to response pairs.
- To determine the "preferred" response, wherever ground truth response to a prompt is available, use it as the "chosen response" and the other as rejected. In other cases, use "Reward-Model-as-Judge", i.e. use the Nemotron-4 340B Reward model to predict reward for each responses (given the same prompt).

Iterative Alignment

After pre-training to get the Nemotron-4 340B Base model (referred to as 340B Interm-1 Base in the paper), authors conducted alignment (SFT + PFT) "iteratively". This is how they did it:

Iteration 1: Generate synthetic data using Mixtral 8x7B Instruct v0.1. Use the data for SFT+PFT on Nemotron-4 340B Interm-1 Base to get an intermediate checkpoint 340B Interm-1 Instruct.
Iteration 2: Conduct pre-training again - this time up-weighting those datapoints in Phase 2 for which 340B Interm-1 Instruct's accuracy was low - to get 340B Interm-2 Base. Generate synthetic data using 340B Interm-1 Instruct. Use this data for SFT+PFT on 340B Interm-2 Base to get 340B Interm-2 Instruct.