Verification-Guided Code Generation Training
An RLVR framework using compiler feedback as reward signals for iterative model refinement. Built for AMD Strix Halo.
Motivation
Traditional approaches to improving code generation models face fundamental limitations:
| Approach | Limitation |
|---|---|
| SFT only | Distribution mismatch — model outputs differ from training data |
| RLHF | Expensive human labeling, inconsistent judgments, doesn't scale |
| Self-evaluation | Models hallucinate correctness, signals can be gamed |
Core insight: A compiler provides a perfect reward signal — unambiguous, deterministic feedback about code correctness that cannot be gamed.
Methodology
halo-forge implements RAFT (Reward-Ranked Fine-Tuning), a simpler alternative to full RL algorithms. The approach is essentially iterated rejection sampling:
Why RAFT over PPO/GRPO?
| Approach | Complexity | Memory | Stability |
|---|---|---|---|
| PPO | High | 4x model | Requires careful tuning |
| GRPO | Medium | 2x model | Better but still tricky |
| RAFT | Low | 1x model | Simple and stable |
Graduated Rewards
Binary rewards (0 or 1) create sparse gradient signals. halo-forge uses graduated rewards to provide partial credit:
| Outcome | Reward | Signal |
|---|---|---|
| Syntax error | 0.0 | Complete failure |
| Compiles with warnings | 0.3 | Partial success |
| Compiles clean | 0.5 | Compilation success |
| Runs without crash | 0.7 | Execution success |
| Correct output | 1.0 | Full verification |
Pipeline
Verifiers
Pluggable components that provide the reward signal:
| Verifier | Language | Notes |
|---|---|---|
| GCC / Clang | C/C++ | Local Linux compilation |
| MinGW | C/C++ | Windows cross-compilation |
| MSVC (SSH) | C/C++ | Remote Windows build server |
| pytest / unittest | Python | Test-based verification |
| Custom | Any | Subprocess-based, extensible |
Results
Production Training
Qwen2.5-Coder-7B with 569 C/C++ prompts over 6 RAFT cycles:
Cycle Progression
| Stage | Compile Rate | pass@1 | Notes |
|---|---|---|---|
| SFT Baseline | 15.2% | 18.7% | Before RAFT |
| Cycle 1 | 28.4% | 35.2% | +87% improvement |
| Cycle 3 | 39.7% | 48.2% | Steady gains |
| Cycle 6 | 46.7% | 55.3% | Peak (diminishing returns after) |
Demo Benchmarks
Quick validation on 16 built-in prompts (validates pipeline, not RAFT effectiveness):
| Model | Baseline | After 2 Cycles | Time |
|---|---|---|---|
| Qwen2.5-Coder-0.5B | 32.0% | 32.0% | 41 min |
| Qwen2.5-Coder-1.5B | 67.2% | 67.2% | 52 min |
| Qwen2.5-Coder-3B | 97.7% | 99.2% | 79 min |
HumanEval Validation
3B model RAFT training with pytest verification (20 prompts, 3 cycles):
| Cycle | Pass Rate | Loss | Grad Norm |
|---|---|---|---|
| 1 | 21.2% | 0.563 | 995.3 |
| 2 | 17.5% | 0.570 | 194.0 |
| 3 | 22.5% | 0.526 | 0.47 |
Key finding: Gradient norm stabilization (995 → 0.47) indicates proper convergence.
Hardware Observations
- BF16 is optimal — 4-bit quantization actually slower due to dequantization overhead
- GPU utilization reaches 95-99% during training
- Generation speed: 0.5B ~220 tok/s, 3B ~130 tok/s, 7B ~80 tok/s
Limitations
- Compilation ≠ correctness — Code may compile but produce wrong output
- Diminishing returns — Performance plateaus after 5-6 cycles
- Domain-specific — Results shown are for C/C++; other languages may differ
- Hardware-specific — Tuned for Strix Halo; other configs may need adjustment
Documentation
Installation
References
- RAFT: Reward rAnked FineTuning — Dong et al., 2023
- STaR: Self-Taught Reasoner — Zelikman et al., 2022
- malagent — RLVR applied to security research