
Verification-Guided Code Generation Training
An RLVR framework using compiler feedback as reward signals for iterative model refinement. Built for AMD Strix Halo.
Why This Exists
An open-source experiment in local model training on AMD's new unified memory architecture.
128GB Unified Memory
AMD Strix Halo can fit 7B+ models entirely in memory — no quantization needed
Compiler as Teacher
Use GCC/pytest as a free, deterministic reward signal — no human labeling required
Iterate Locally
Run full RLVR training cycles on your own hardware, no cloud dependencies
The RAFT Approach
halo-forge implements RAFT (Reward-Ranked Fine-Tuning) — essentially iterated rejection sampling:
for cycle in range(num_cycles):
# 1. Generate samples
samples = model.generate(prompts, n=8)
# 2. Verify with compiler
results = verifier.verify_batch(samples)
# 3. Filter by reward threshold
filtered = [s for s, r in zip(samples, results)
if r.reward >= 0.5]
# 4. Fine-tune on verified samples
model.train(filtered)This is simpler than PPO/GRPO (1x model memory vs 2-4x), stable to train, and produces comparable results.
Results
Production training on Qwen2.5-Coder-7B with 569 C/C++ prompts:
| Stage | Compile Rate | pass@1 |
|---|---|---|
| SFT Baseline | 15.2% | 18.7% |
| Cycle 1 | 28.4% | 35.2% |
| Cycle 3 | 39.7% | 48.2% |
| Cycle 6 (Peak) | 46.7% | 55.3% |
Quick Start
# Build the toolbox (ROCm 7 + PyTorch nightly)
git clone https://github.com/professor-moody/halo-forge.git
cd halo-forge/toolbox && ./build.sh
# Create and enter the container
toolbox create halo-forge --image localhost/halo-forge:latest
toolbox enter halo-forge
# Validate installation
halo-forge test --level smoke # Quick check, no GPU
halo-forge test --level standard # Full validation
# Run demo benchmark
halo-forge benchmark full --model Qwen/Qwen2.5-Coder-0.5B --cycles 2Graduated Rewards
Binary rewards create sparse gradients. halo-forge uses graduated rewards:
| Outcome | Reward | Signal |
|---|---|---|
| Syntax error | 0.0 | Completely wrong |
| Compiles with warnings | 0.3 | Close but imperfect |
| Compiles clean | 0.5 | Correct syntax |
| Runs without crash | 0.7 | Executable |
| Correct output | 1.0 | Fully correct |
Related Projects
malagent — Applies RLVR to security research (EDR evasion with Elastic Security as verifier)