Documentation
Complete documentation for halo-forge RLVR training framework
What is halo-forge?
halo-forge is an RLVR (Reinforcement Learning from Verifiable Rewards) framework that uses compiler feedback as reward signals for iterative model refinement.
The Problem
| Approach | Limitation |
|---|---|
| SFT only | Distribution mismatch — model outputs differ from training data |
| RLHF | Expensive human labeling, inconsistent judgments |
| Self-evaluation | Models hallucinate correctness, signals can be gamed |
The Solution
A compiler provides a perfect reward signal — unambiguous, deterministic feedback about code correctness that cannot be gamed.
Architecture
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐
│ Data │ → │ SFT │ → │ RAFT │ → │ Benchmark │
└──────────┘ └──────────┘ └──────────┘ └───────────┘
- Data — Gather training examples from public datasets or LLM generation
- SFT — Supervised fine-tuning to establish baseline capability
- RAFT — Iterative verification loop: generate → verify → filter → train
- Benchmark — Evaluate with pass@k metrics
Results
Production training on Qwen2.5-Coder-7B with 569 C/C++ prompts:
| Stage | Compile Rate | pass@1 |
|---|---|---|
| SFT Baseline | 15.2% | 18.7% |
| Cycle 1 | 28.4% | 35.2% |
| Cycle 3 | 39.7% | 48.2% |
| Cycle 6 (Peak) | 46.7% | 55.3% |
3x improvement over 6 RAFT cycles.
Quick Navigation
Getting Started
- Quick Start — Get running in 30 minutes
- Toolbox Setup — Build the container environment
- Hardware Notes — Strix Halo configuration
Training Pipeline
- Full Pipeline — Complete training workflow
- Data Generation — Prepare training data
- SFT Training — Supervised fine-tuning
- RAFT Training — Reward-ranked fine-tuning
- Benchmarking — Evaluate with pass@k
Verifiers
- Verifier Overview — Choose your verification strategy
- Compile Verifiers — GCC, Clang, MinGW, MSVC
- Test Verifiers — pytest, unittest
- Custom Verifiers — Build your own
Reference
- Configuration — Complete config reference
- CLI Reference — Command-line interface
- Troubleshooting — Common issues
Background
- Theory & Research — Research foundations
- RAFT vs PPO vs GRPO — Algorithm comparison
- Graduated Rewards — Partial credit system
- Learning Rate Strategies — LR recommendations
Meta
- Changelog — Version history
- Contributing — How to contribute
Configuration
Complete configuration reference
Full Pipeline
Complete guide to training a code generation model
Quick Start
Get halo-forge running in under 30 minutes
Theory & Research
RLVR paradigm and research foundations
CLI Reference
Complete command-line interface reference
Data Generation
Preparing training data for SFT and RAFT
RAFT vs PPO vs GRPO
Comparing reinforcement learning approaches for code generation
SFT Training
Supervised fine-tuning to establish baseline capability
Toolbox Setup
Build and configure the halo-forge container environment
Troubleshooting
Common issues and solutions
Graduated Rewards
Why partial credit matters for RLVR training
Hardware Notes
Configuration for AMD Strix Halo
RAFT Training
Reward-Ranked Fine-Tuning with compiler verification
Learning Rate Strategies
Experimental learning rate recommendations for RAFT training
Benchmarking
Evaluate model performance with pass@k metrics
Changelog
All notable changes to halo-forge
Contributing
How to contribute to halo-forge
Verifiers
Pluggable verification system for RLVR training