
Verification-Guided Code Generation Training
An RLVR framework using compiler feedback as reward signals for iterative model refinement. Built for AMD Strix Halo.
Why This Exists
An open-source experiment in local model training on AMD's new unified memory architecture.
128GB Unified Memory
AMD Strix Halo can fit 7B+ models entirely in memory — no quantization needed
Compiler as Teacher
Use GCC/pytest as a free, deterministic reward signal — no human labeling required
Iterate Locally
Run full RLVR training cycles on your own hardware, no cloud dependencies
The RAFT Approach
halo-forge implements RAFT (Reward-Ranked Fine-Tuning) — essentially iterated rejection sampling:
for cycle in range(num_cycles):
# 1. Generate samples
samples = model.generate(prompts, n=8)
# 2. Verify with compiler
results = verifier.verify_batch(samples)
# 3. Filter by reward threshold
filtered = [s for s, r in zip(samples, results)
if r.reward >= 0.5]
# 4. Fine-tune on verified samples
model.train(filtered)This is simpler than PPO/GRPO (1x model memory vs 2-4x), stable to train, and produces comparable results.
How It Works
RAFT iteratively improves code generation through verified feedback:
| Cycle | What Happens | Expected |
|---|---|---|
| 1-2 | Model learns basic patterns | Largest gains |
| 3-4 | Refinement continues | Moderate gains |
| 5-6 | Approach plateau | Monitor for stopping |
Results vary by model, dataset, and domain. Run your own benchmarks to measure improvement.
Quick Start
# Build the toolbox (ROCm 7 + PyTorch nightly)
git clone https://github.com/professor-moody/halo-forge.git
cd halo-forge/toolbox && ./build.sh
# Create and enter the container
toolbox create halo-forge --image localhost/halo-forge:latest
toolbox enter halo-forge
# Validate installation
halo-forge test --level smoke # Quick check, no GPU
halo-forge test --level standard # Full validation
# Run demo benchmark
halo-forge benchmark full --model Qwen/Qwen2.5-Coder-0.5B --cycles 2Graduated Rewards
Binary rewards create sparse gradients. halo-forge uses graduated rewards:
| Outcome | Reward | Signal |
|---|---|---|
| Syntax error | 0.0 | Completely wrong |
| Compiles with warnings | 0.3 | Close but imperfect |
| Compiles clean | 0.5 | Correct syntax |
| Runs without crash | 0.7 | Executable |
| Correct output | 1.0 | Fully correct |
Related Projects
malagent — Applies RLVR to security research (EDR evasion with Elastic Security as verifier)