Theory & Research
RLVR paradigm and research foundations
The RLVR Paradigm
Reinforcement Learning from Verifiable Rewards (RLVR) uses deterministic, programmatic signals instead of human feedback.
Traditional RLHF Problems
| Problem | RLHF | RLVR |
|---|---|---|
| Signal source | Human annotators | Compilers, tests, APIs |
| Consistency | Varies by annotator | Deterministic |
| Scale | Expensive to scale | Unlimited |
| Gaming | Can be manipulated | Cannot be fooled |
| Latency | Days to weeks | Milliseconds |
When to Use RLVR
RLVR works when you have:
- Verifiable correctness — A program can check if output is correct
- Binary or graduated signal — Clear pass/fail or quality score
- Fast feedback — Verification completes quickly
Examples:
- Code generation (compiler)
- Math problems (symbolic checker)
- SQL queries (database execution)
- API compliance (schema validator)
RAFT: Reward-Ranked Fine-Tuning
RAFT is simpler than PPO/GRPO while achieving similar results.
The Algorithm
Initialize model M from SFT checkpoint
For each cycle:
1. Generate N samples per prompt using M
2. Verify all samples, get rewards
3. Filter to samples with reward >= threshold
4. Take top K% of filtered samples
5. Fine-tune M on selected samples
6. Repeat
Why RAFT Works
- Iterated rejection sampling — Each cycle moves the distribution toward higher rewards
- Simple optimization — Just SFT on good examples, no value networks
- Stable training — No reward hacking, no mode collapse
- Memory efficient — 1x model memory vs 2-4x for PPO
Algorithm Comparison
| Method | Memory | Stability | Complexity |
|---|---|---|---|
| RAFT | 1x | Tends to be stable | Lower |
| PPO | 4x | Requires tuning | Higher |
| GRPO | 2x | Moderate | Moderate |
| DPO | 1x | Tends to be stable | Lower |
RAFT uses offline training (generate-verify-filter-train) rather than online RL updates.
Graduated Rewards
Binary rewards create sparse gradients:
reward = 1.0 if compiles else 0.0 # Bad: no gradient for "almost compiles"
Graduated rewards provide denser signal:
if not compiles:
reward = 0.0 # Syntax error
elif has_warnings:
reward = 0.3 # Close
elif not runs:
reward = 0.5 # Compiles but crashes
elif wrong_output:
reward = 0.7 # Runs but incorrect
else:
reward = 1.0 # Full reward
This helps training because:
- Near-misses get partial credit
- Model learns “direction” to improve
- Faster convergence to working code
Curriculum Effects
RAFT exhibits natural curriculum learning:
| Early Cycles | Late Cycles |
|---|---|
| Fix syntax errors | Optimize logic |
| Add missing includes | Handle edge cases |
| Fix type mismatches | Improve structure |
The filter threshold creates implicit curriculum:
- Cycle 1: Keep anything that compiles
- Cycle 5: Keep only high-quality samples
Scaling Laws
Empirical observations:
- More samples per prompt → Better filtering, higher peak
- More prompts → More diversity, less overfitting
- More cycles → Diminishing returns after 5-6
- Higher threshold → Fewer samples, potentially better quality
Research Foundations
Key Papers
RAFT: Reward rAnked FineTuning (Dong et al., 2023)
- Original RAFT algorithm
- OpenReview
STaR: Self-Taught Reasoner (Zelikman et al., 2022)
- Related iterative self-improvement
- arXiv
DeepSeek-Coder-V2 (2024)
- Uses RLVR for code models
- Scaled to 236B parameters
Qwen2.5-Coder (2024)
- Code-focused model family
- Base for halo-forge experiments
Related Work
- Rejection sampling fine-tuning (RSO)
- Constitutional AI (Anthropic)
- Self-play (AlphaGo, AlphaCode)
- Iterative distillation
Open Questions
- Cycle count — When to stop? Current heuristic: monitor for degradation
- Prompt diversity — How much variation needed to prevent overfitting?
- Reward shaping — Suitable reward function for different domains?
- Transfer — Do RAFT improvements transfer across tasks?
Contributions
halo-forge contributes:
- Framework — Reusable RLVR training infrastructure
- Verifiers — Pluggable verification system
- Hardware optimization — Strix Halo unified memory configuration
- Documentation — Practical guidance for RLVR training