Getting Started with NVIDIA NeMo RL + LLM
A practical guide to teaching LLMs to learn from feedback using NVIDIA NeMo RL - no PhD required.
This post is part of my ongoing journey into AI ethics and LLM training. See my first post for context on why I'm exploring this space.
Someone posted a job wanting RL experience to optimize a tree classifier. My first thought: that's catastrophically over-engineered. My second thought: I don't actually know enough about RL to be sure.

So I spent two weeks finding out.
I've spent the last two weeks diving into NVIDIA NeMo RL-their toolkit for training LLMs using reinforcement learning. This post is what I wish I had when I started: a practical, no-fluff guide to getting started with RL + LLMs without a PhD.
What is RL + LLM? (And Why Should You Care?)
If you've fine-tuned an LLM, you've done supervised fine-tuning (SFT). You give the model input-output pairs, it learns to mimic them. Simple, effective, but limited.
The problem: SFT teaches the model to repeat what's in the training data. It doesn't teach the model to reason, prefer certain outputs, or improve through feedback.
This is where Reinforcement Learning from Human Feedback (RLHF) comes in. Instead of just copying examples, the model learns from preferences:
- "This response is better than that one"
- "This code is more efficient"
- "This explanation is clearer"

NVIDIA NeMo RL is a toolkit that makes this practical. It supports multiple training methods:
| Method | What It Does | Best For |
|---|---|---|
| PPO (Proximal Policy Optimization) | Classic RL, uses reward model | Complex tasks, multi-objective |
| DPO (Direct Preference Optimization) | Simplifies RLHF, no reward model needed | Chat, general alignment |
| GRPO | Group-relative optimization, ~50% less memory than PPO | Reasoning, math, code |
The key insight: these methods don't just teach the model what to say. They teach it how to decide what's better.
Why This Matters Now
Here's why I'm spending time on this instead of just fine-tuning with LoRA:
-
Reasoning improvements - Recent papers show RL-trained models outperform SFT on math, code, and complex reasoning tasks (DeepSeek-R1, OpenAI o1)
-
Alignment is practical now - DPO makes RLHF accessible without massive compute budgets
-
Industry demand - The job posting wasn't a fluke. Companies are building agents that need RL-trained decision making
-
It's genuinely interesting - Watching a model learn from preferences feels closer to "actual learning" than gradient descent on text
Core Concepts You Need to Understand
Before running training, you need to understand these four concepts. Skip this and you'll waste hours debugging things that make sense once you know the basics.
1. The Reward Model
The reward model is a separate LLM that scores outputs. You train it on human preference data: "Output A is better than Output B."
Why it matters: If your reward model is wrong, your training will optimize for the wrong thing. This leads to reward hacking.
2. The Policy Model
This is the LLM you're actually training. It generates outputs, receives rewards, and updates its behavior.
Key constraint: PPO uses a clipping mechanism to limit per-step updates, and a KL penalty against a frozen reference model to prevent the policy from drifting too far overall. DPO controls this through its beta parameter. Too much drift = catastrophic forgetting. Too little = no learning.
3. The Training Loop

For PPO and GRPO (online RL methods):
1. Policy generates outputs
2. Reward model scores outputs
3. PPO/GRPO updates policy based on rewards
4. Repeat
DPO works differently - it's an offline method. No generation, no reward model at training time. It optimizes the policy directly on a static dataset of preference pairs, using a modified cross-entropy loss. Structurally closer to SFT than to PPO.
4. Preference Data
RLHF needs pairs of responses with preferences. Format:
[
{
"prompt": "Explain quantum computing",
"chosen": "Quantum computing uses qubits that can be in superposition...",
"rejected": "I don't know, quantum stuff is confusing"
}
]
You need hundreds to thousands of these for meaningful training.
Setting Up NVIDIA NeMo RL
Getting started isn't trivial. Here's what you need.

Requirements
- GPU: NVIDIA GPU with CUDA 12.x (A100, H100, or newer recommended)
- VRAM: 40-80GB for 7B models with PPO (depending on optimization settings), 320GB+ for 70B models (multi-node, 4-8x H100s)
- Storage: 100GB+ for models and datasets
- Python: 3.10+
Installation
NVIDIA provides NeMo RL for large-scale RLHF training. Note: the older NeMo-Aligner repo is now archived - NeMo RL is the current toolkit.
# Clone the NeMo RL repository
git clone https://github.com/NVIDIA/NeMo-RL.git nemo-rl --recursive
cd nemo-rl
# Install with uv (recommended by NVIDIA)
pip install uv
uv sync
# See the official install docs for full setup:
# https://docs.nvidia.com/nemo/rl/latest/about/installation.html
For simpler DPO training, you can also use TRL (Transformer Reinforcement Learning) from HuggingFace:
pip install trl
Verify Installation
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
Running Your First Experiment
Here's a minimal example to understand the flow. This is pseudocode to illustrate the concept-real implementations require more setup.
Step 1: Prepare Preference Data
Create a JSON file with your preference pairs:
[
{
"prompt": "Write a function to add two numbers",
"chosen": "def add(a, b):\n return a + b",
"rejected": "just use the + operator"
},
{
"prompt": "What is Python?",
"chosen": "Python is a high-level programming language known for its readability...",
"rejected": "Python is a snake"
}
]
Step 2: Configure Training (Pseudocode)
# Simplified example - see TRL docs for full API details
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")
training_args = DPOConfig(
beta=0.1, # KL coefficient - limits policy drift
learning_rate=1e-6,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
)
trainer = DPOTrainer(
model=model,
ref_model=None, # None only works correctly with PEFT/LoRA — pass an explicit ref model for full-weight training
args=training_args,
train_dataset=preference_data,
processing_class=tokenizer,
)
trainer.train()
Step 3: Monitor Training

What to Watch
| Metric | Good Sign | Bad Sign |
|---|---|---|
| Reward | Increasing | Flat or decreasing |
| KL Divergence | Stable (varies by method) | Exploding |
| Loss | Decreasing (DPO) or stable oscillation (PPO) | NaN or inf |
| GPU Memory | Stable | OOM errors |
What Makes This Hard (The Skeptical View)

Now for the part where I tell you why you'll want to throw your GPU out a window:
1. RL Training is Unstable
The same hyperparameters that work one day fail the next. One ICLR study found that optimizer choice alone causes 6x higher logprob variance - PyTorch vs TensorFlow Adam, same config, wildly different training dynamics. Training can diverge, collapse, or produce garbage. This is not "set and forget."
2. Reward Hacking
The model finds ways to maximize the reward without actually solving the task. Lilian Weng's deep dive on reward hacking covers this thoroughly - models trained with RLHF can become "convincingly wrong," not just wrong. Classic examples:
- Generating excessively verbose or confident-sounding outputs that score high but add no substance
- Exploiting patterns in the reward model's preferences rather than genuinely improving quality
3. Data Requirements
You need high-quality preference data. Lots of it. This is often the bottleneck, not the model or compute.
4. GPU Memory
A 7B model with PPO can need 40-80GB of VRAM. Why so high? PPO requires running four model copies simultaneously:
- Policy model (the LLM being trained)
- Reference policy (frozen copy, used for KL divergence)
- Reward model (scores outputs)
- Value model (estimates future rewards)
That's 4 models in memory at once, plus gradients and optimizer states. 70B models need multi-GPU setups (multiple H100s).
If you only have 24GB: Try DPO instead-it doesn't need a separate reward or value model. A 7B DPO run fits on a single A100 (40GB), or on 24GB with LoRA/QLoRA.
5. It's Still Research
Unlike fine-tuning (well-trodden path), RL + LLM is cutting edge. Best practices are still emerging. You'll encounter issues with no clear answers.
What's Next
The job posting was over-engineered. But two weeks of investigating it taught me more about how LLMs actually learn than six months of LoRA fine-tuning did. That trade-off was worth it.
If you're starting out, go with DPO — no reward model, fits on a single GPU, closest thing to SFT you already know. If that goes well, GRPO is the next step up without the memory tax of PPO.
My plan:
- Run a small DPO experiment this week (even on a tiny dataset)
- Try GRPO on a reasoning task
- Compare results with standard fine-tuning
Resources & References

Papers (Read This Order)
- DPO: Your Language Model is Secretly a Reward Model - Start here, simplest method
- DeepSeekMath (introduces GRPO) - Group-relative optimization, efficient
- PPO: Proximal Policy Optimization - The classic (if you have time)
Practical Guides
- RLHF in 2024 with DPO & HuggingFace - Philipp Schmid's end-to-end DPO walkthrough on a single GPU
- Align LLMs in 2025 with DPO & Synthetic Data - On-policy synthetic preferences on RTX 4090
- Fine-tune Llama 2 with DPO - The canonical TRL DPO tutorial
- RLHF 101: A Technical Tutorial - CMU's end-to-end RLHF implementation (22.9% → 48.3% AlpacaEval)
Deep Dives
- The N Implementation Details of RLHF with PPO - Why PPO is so fragile (ICLR 2024)
- Reward Hacking in RL - Lilian Weng's taxonomy of reward hacking failure modes
- GRPO Explained - Cameron Wolfe on how GRPO eliminates the critic model
- PPO for LLMs: A Guide for Normal People - Why PPO is the complexity ceiling
- Advanced Understanding of GRPO - HuggingFace LLM Course with PyTorch implementation
Code & Docs
- NVIDIA NeMo RL Docs · GitHub · Examples
- TRL Library · GitHub
- OpenRLHF
- DeepSeek-R1
- HuggingFace Deep RL Course - RLHF Unit
- VRAM Requirements for LLM Fine-tuning - Modal's GPU memory breakdown by model size
Last updated: March 13, 2026
Continue in AI