Verification-Guided Code Generation Training

An RLVR framework using compiler feedback as reward signals for iterative model refinement. Built for AMD Strix Halo.

Version 0.2.0

Hardware AMD gfx1151 / 128GB

License Apache 2.0

Motivation

Traditional approaches to improving code generation models face fundamental limitations:

Approach	Limitation
SFT only	Distribution mismatch — model outputs differ from training data
RLHF	Expensive human labeling, inconsistent judgments, doesn't scale
Self-evaluation	Models hallucinate correctness, signals can be gamed

Core insight: A compiler provides a perfect reward signal — unambiguous, deterministic feedback about code correctness that cannot be gamed.

Methodology

halo-forge implements RAFT (Reward-Ranked Fine-Tuning), a simpler alternative to full RL algorithms. The approach is essentially iterated rejection sampling:

                        
                        
                        
                    
raft_algorithm
for cycle in range(num_cycles):
    # 1. Generate samples
    samples = model.generate(prompts, n=8)
    
    # 2. Verify with compiler
    results = verifier.verify_batch(samples)
    
    # 3. Filter by reward threshold
    filtered = [s for s, r in zip(samples, results) 
                if r.reward >= 0.5]
    
    # 4. Fine-tune on verified samples
    model.train(filtered)
    
    # 5. Repeat with updated model

Why RAFT over PPO/GRPO?

Approach	Complexity	Memory	Stability
PPO	High	4x model	Requires careful tuning
GRPO	Medium	2x model	Better but still tricky
RAFT	Low	1x model	Simple and stable

Graduated Rewards

Binary rewards (0 or 1) create sparse gradient signals. halo-forge uses graduated rewards to provide partial credit:

Outcome	Reward	Signal
Syntax error	0.0	Complete failure
Compiles with warnings	0.3	Partial success
Compiles clean	0.5	Compilation success
Runs without crash	0.7	Execution success
Correct output	1.0	Full verification

Pipeline

Data

CodeForces, MBPP, or LLM-generated

→

SFT

LoRA + BF16 baseline

→

RAFT

Generate → Verify → Filter → Train

→

Benchmark

pass@k evaluation

Verifiers

Pluggable components that provide the reward signal:

Verifier	Language	Notes
GCC / Clang	C/C++	Local Linux compilation
MinGW	C/C++	Windows cross-compilation
MSVC (SSH)	C/C++	Remote Windows build server
pytest / unittest	Python	Test-based verification
Custom	Any	Subprocess-based, extensible

Results

Production Training

Qwen2.5-Coder-7B with 569 C/C++ prompts over 6 RAFT cycles:

46.7%

Peak compile rate (Cycle 6)

15.2%

SFT baseline

Improvement over baseline

55.3%

pass@1 (Cycle 6)

Cycle Progression

Stage	Compile Rate	pass@1	Notes
SFT Baseline	15.2%	18.7%	Before RAFT
Cycle 1	28.4%	35.2%	+87% improvement
Cycle 3	39.7%	48.2%	Steady gains
Cycle 6	46.7%	55.3%	Peak (diminishing returns after)

Demo Benchmarks

Quick validation on 16 built-in prompts (validates pipeline, not RAFT effectiveness):

Model	Baseline	After 2 Cycles	Time
Qwen2.5-Coder-0.5B	32.0%	32.0%	41 min
Qwen2.5-Coder-1.5B	67.2%	67.2%	52 min
Qwen2.5-Coder-3B	97.7%	99.2%	79 min

HumanEval Validation

3B model RAFT training with pytest verification (20 prompts, 3 cycles):

Cycle	Pass Rate	Loss	Grad Norm
1	21.2%	0.563	995.3
2	17.5%	0.570	194.0
3	22.5%	0.526	0.47

Key finding: Gradient norm stabilization (995 → 0.47) indicates proper convergence.

Hardware Observations

BF16 is optimal — 4-bit quantization actually slower due to dequantization overhead
GPU utilization reaches 95-99% during training
Generation speed: 0.5B ~220 tok/s, 3B ~130 tok/s, 7B ~80 tok/s

Limitations

Compilation ≠ correctness — Code may compile but produce wrong output
Diminishing returns — Performance plateaus after 5-6 cycles
Domain-specific — Results shown are for C/C++; other languages may differ
Hardware-specific — Tuned for Strix Halo; other configs may need adjustment

Documentation

→

Installation

                        
                        
                        
                    
terminal
# Build the toolbox (ROCm 7 + PyTorch nightly)
git clone https://github.com/professor-moody/halo-forge.git
cd halo-forge/toolbox && ./build.sh

# Enter the container
toolbox enter halo-forge

# Validate installation
halo-forge test --level smoke      # Quick check, no GPU
halo-forge test --level standard   # Full validation

# Run demo benchmark
halo-forge benchmark full --model Qwen/Qwen2.5-Coder-0.5B --cycles 2

References

RAFT: Reward rAnked FineTuning — Dong et al., 2023
STaR: Self-Taught Reasoner — Zelikman et al., 2022
malagent — RLVR applied to security research

Verification-Guided Code Generation Training

Motivation

Methodology

Why RAFT over PPO/GRPO?

Graduated Rewards

Pipeline

Verifiers

Results

Production Training

Cycle Progression

Demo Benchmarks

HumanEval Validation

Hardware Observations

Limitations

Documentation

Quick Start

Full Pipeline

Theory

Verifiers

Benchmarks

Hardware Notes

Installation

References