Verification-Guided Code Generation Training

An RLVR framework using compiler feedback as reward signals for iterative model refinement. Built for AMD Strix Halo.

Version 0.2.0
Hardware AMD gfx1151 / 128GB
License Apache 2.0

Motivation

Traditional approaches to improving code generation models face fundamental limitations:

Approach Limitation
SFT only Distribution mismatch — model outputs differ from training data
RLHF Expensive human labeling, inconsistent judgments, doesn't scale
Self-evaluation Models hallucinate correctness, signals can be gamed

Core insight: A compiler provides a perfect reward signal — unambiguous, deterministic feedback about code correctness that cannot be gamed.

Methodology

halo-forge implements RAFT (Reward-Ranked Fine-Tuning), a simpler alternative to full RL algorithms. The approach is essentially iterated rejection sampling:

raft_algorithm
for cycle in range(num_cycles): # 1. Generate samples samples = model.generate(prompts, n=8) # 2. Verify with compiler results = verifier.verify_batch(samples) # 3. Filter by reward threshold filtered = [s for s, r in zip(samples, results) if r.reward >= 0.5] # 4. Fine-tune on verified samples model.train(filtered) # 5. Repeat with updated model

Why RAFT over PPO/GRPO?

Approach Complexity Memory Stability
PPO High 4x model Requires careful tuning
GRPO Medium 2x model Better but still tricky
RAFT Low 1x model Simple and stable

Graduated Rewards

Binary rewards (0 or 1) create sparse gradient signals. halo-forge uses graduated rewards to provide partial credit:

Outcome Reward Signal
Syntax error 0.0 Complete failure
Compiles with warnings 0.3 Partial success
Compiles clean 0.5 Compilation success
Runs without crash 0.7 Execution success
Correct output 1.0 Full verification

Pipeline

01
Data
CodeForces, MBPP, or LLM-generated
02
SFT
LoRA + BF16 baseline
03
RAFT
Generate → Verify → Filter → Train
04
Benchmark
pass@k evaluation

Verifiers

Pluggable components that provide the reward signal:

Verifier Language Notes
GCC / Clang C/C++ Local Linux compilation
MinGW C/C++ Windows cross-compilation
MSVC (SSH) C/C++ Remote Windows build server
pytest / unittest Python Test-based verification
Custom Any Subprocess-based, extensible

Results

Production Training

Qwen2.5-Coder-7B with 569 C/C++ prompts over 6 RAFT cycles:

46.7%
Peak compile rate (Cycle 6)
15.2%
SFT baseline
3x
Improvement over baseline
55.3%
pass@1 (Cycle 6)

Cycle Progression

Stage Compile Rate pass@1 Notes
SFT Baseline 15.2% 18.7% Before RAFT
Cycle 1 28.4% 35.2% +87% improvement
Cycle 3 39.7% 48.2% Steady gains
Cycle 6 46.7% 55.3% Peak (diminishing returns after)

Demo Benchmarks

Quick validation on 16 built-in prompts (validates pipeline, not RAFT effectiveness):

Model Baseline After 2 Cycles Time
Qwen2.5-Coder-0.5B 32.0% 32.0% 41 min
Qwen2.5-Coder-1.5B 67.2% 67.2% 52 min
Qwen2.5-Coder-3B 97.7% 99.2% 79 min

HumanEval Validation

3B model RAFT training with pytest verification (20 prompts, 3 cycles):

Cycle Pass Rate Loss Grad Norm
1 21.2% 0.563 995.3
2 17.5% 0.570 194.0
3 22.5% 0.526 0.47

Key finding: Gradient norm stabilization (995 → 0.47) indicates proper convergence.

Hardware Observations

Limitations

Documentation

Installation

terminal
# Build the toolbox (ROCm 7 + PyTorch nightly) git clone https://github.com/professor-moody/halo-forge.git cd halo-forge/toolbox && ./build.sh # Enter the container toolbox enter halo-forge # Validate installation halo-forge test --level smoke # Quick check, no GPU halo-forge test --level standard # Full validation # Run demo benchmark halo-forge benchmark full --model Qwen/Qwen2.5-Coder-0.5B --cycles 2

References