Documentation
Complete documentation for halo-forge RLVR training framework
What is halo-forge?
halo-forge is an RLVR (Reinforcement Learning from Verifiable Rewards) framework that uses compiler feedback as reward signals for iterative model refinement.
The Problem
| Approach | Limitation |
|---|---|
| SFT only | Distribution mismatch — model outputs differ from training data |
| RLHF | Expensive human labeling, inconsistent judgments |
| Self-evaluation | Models hallucinate correctness, signals can be gamed |
The Approach
A compiler provides deterministic feedback — objective, reproducible results about code correctness.
Architecture
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐
│ Data │ → │ SFT │ → │ RAFT │ → │ Benchmark │
└──────────┘ └──────────┘ └──────────┘ └───────────┘
- Data — Gather training examples from public datasets or LLM generation
- SFT — Supervised fine-tuning to establish baseline capability
- RAFT — Iterative verification loop: generate → verify → filter → train
- Benchmark — Evaluate with pass@k metrics
Example Results
Results from test runs on Qwen2.5-Coder-7B with 569 C/C++ prompts. Your results will vary based on model, dataset, hardware, and configuration.
| Stage | Compile Rate | pass@1 |
|---|---|---|
| SFT Baseline | 15.2% | 18.7% |
| Cycle 1 | 28.4% | 35.2% |
| Cycle 3 | 39.7% | 48.2% |
| Cycle 6 | 46.7% | 55.3% |
Improvements were observed over 6 RAFT cycles in our testing.
Quick Navigation
Getting Started
- Quick Start — Get running in 30 minutes
- Toolbox Setup — Build the container environment
- Hardware Notes — Strix Halo configuration
Training Pipeline
- How to Train — Complete step-by-step guide (start here!)
- Full Pipeline — Complete training workflow
- Data Generation — Prepare training data
- SFT Training — Supervised fine-tuning
- RAFT Training — Reward-ranked fine-tuning
- Benchmarking — Evaluate with pass@k
Verifiers
- Verifier Overview — Choose your verification strategy
- Compile Verifiers — GCC, Clang, MinGW, MSVC
- Test Verifiers — pytest, unittest
- Custom Verifiers — Build your own
Reference
- Configuration — Complete config reference
- CLI Reference — Command-line interface
- Troubleshooting — Common issues
Background
- Theory & Research — Research foundations
- Graduated Rewards — Partial credit system
- Learning Rate Strategies — LR recommendations
Meta
- Changelog — Version history
- Contributing — How to contribute
Configuration
Complete configuration reference
Full Pipeline
Complete guide to training a code generation model
Quick Start
Get halo-forge running in under 30 minutes
Theory & Research
RLVR paradigm and research foundations
CLI Reference
Complete command-line interface reference
Data Generation
Preparing training data for SFT and RAFT
SFT Training
Supervised fine-tuning to establish baseline capability
Toolbox Setup
Build and configure the halo-forge container environment
Troubleshooting
Common issues and solutions
Graduated Rewards
Why partial credit matters for RLVR training
Hardware Notes
Configuration for AMD Strix Halo
RAFT Training
Reward-Ranked Fine-Tuning with compiler verification
Learning Rate Strategies
Experimental learning rate recommendations for RAFT training
Windows Build Server
Configure a Windows machine for MSVC verification
Benchmarking
Evaluate model performance with pass@k metrics
Production Training Runs
Step-by-step commands for training all model sizes on the Windows Systems Programming dataset
Changelog
All notable changes to halo-forge
Contributing
How to contribute to halo-forge
How to Train
Complete guide to training code generation models with halo-forge
Verifiers
Pluggable verification system for RLVR training