Verification-Guided Code Generation Training

An RLVR framework using compiler feedback as reward signals for iterative model refinement. Built for AMD Strix Halo.

Version 0.2.0 Hardware AMD gfx1151 License Apache 2.0

Why This Exists

An open-source experiment in local model training on AMD's new unified memory architecture.

128GB Unified Memory

AMD Strix Halo can fit 7B+ models entirely in memory — no quantization needed

Compiler as Teacher

Use GCC/pytest as a free, deterministic reward signal — no human labeling required

Iterate Locally

Run full RLVR training cycles on your own hardware, no cloud dependencies

The experiment: Can we meaningfully improve code generation models using only local compute and compiler feedback? Early results suggest yes.

The RAFT Approach

halo-forge implements RAFT (Reward-Ranked Fine-Tuning) — essentially iterated rejection sampling:

python
for cycle in range(num_cycles):
    # 1. Generate samples
    samples = model.generate(prompts, n=8)
    
    # 2. Verify with compiler
    results = verifier.verify_batch(samples)
    
    # 3. Filter by reward threshold
    filtered = [s for s, r in zip(samples, results) 
                if r.reward >= 0.5]
    
    # 4. Fine-tune on verified samples
    model.train(filtered)

This is simpler than PPO/GRPO (1x model memory vs 2-4x), stable to train, and produces comparable results.

Results

Production training on Qwen2.5-Coder-7B with 569 C/C++ prompts:

46.7%
Peak compile rate
Cycle 6
Improvement
vs SFT baseline
55.3%
pass@1
Cycle 6
15.2%
SFT baseline
Before RAFT
StageCompile Ratepass@1
SFT Baseline15.2%18.7%
Cycle 128.4%35.2%
Cycle 339.7%48.2%
Cycle 6 (Peak)46.7%55.3%

Quick Start

bash
# Build the toolbox (ROCm 7 + PyTorch nightly)
git clone https://github.com/professor-moody/halo-forge.git
cd halo-forge/toolbox && ./build.sh

# Create and enter the container
toolbox create halo-forge --image localhost/halo-forge:latest
toolbox enter halo-forge

# Validate installation
halo-forge test --level smoke      # Quick check, no GPU
halo-forge test --level standard   # Full validation

# Run demo benchmark
halo-forge benchmark full --model Qwen/Qwen2.5-Coder-0.5B --cycles 2

Graduated Rewards

Binary rewards create sparse gradients. halo-forge uses graduated rewards:

OutcomeRewardSignal
Syntax error0.0Completely wrong
Compiles with warnings0.3Close but imperfect
Compiles clean0.5Correct syntax
Runs without crash0.7Executable
Correct output1.0Fully correct

Related Projects

malagent — Applies RLVR to security research (EDR evasion with Elastic Security as verifier)