Verification-Guided Code Generation Training

An RLVR framework using compiler feedback as reward signals for iterative model refinement. Built for AMD Strix Halo.

Version 1.0.0 Hardware AMD gfx1151 License Apache 2.0

Get Started View on GitHub

Why This Exists

An open-source experiment in local model training on AMD's new unified memory architecture.

128GB Unified Memory

AMD Strix Halo can fit 7B+ models entirely in memory — no quantization needed

Compiler as Teacher

Use GCC/pytest as a free, deterministic reward signal — no human labeling required

Iterate Locally

Run full RLVR training cycles on your own hardware, no cloud dependencies

The experiment: Can we meaningfully improve code generation models using only local compute and compiler feedback? Early results suggest yes.

The RAFT Approach

halo-forge implements RAFT (Reward-Ranked Fine-Tuning) — essentially iterated rejection sampling:

python

for cycle in range(num_cycles):
    # 1. Generate samples
    samples = model.generate(prompts, n=8)
    
    # 2. Verify with compiler
    results = verifier.verify_batch(samples)
    
    # 3. Filter by reward threshold
    filtered = [s for s, r in zip(samples, results) 
                if r.reward >= 0.5]
    
    # 4. Fine-tune on verified samples
    model.train(filtered)

This is simpler than PPO/GRPO (1x model memory vs 2-4x), stable to train, and produces comparable results.

How It Works

RAFT iteratively improves code generation through verified feedback:

Generate

Multiple solutions

Per prompt

Verify

Compile & test

Real feedback

Filter

Keep best samples

By reward

Train

On verified code

Repeat cycles

Cycle	What Happens	Expected
1-2	Model learns basic patterns	Largest gains
3-4	Refinement continues	Moderate gains
5-6	Approach plateau	Monitor for stopping

Results vary by model, dataset, and domain. Run your own benchmarks to measure improvement.

Quick Start

bash

# Build the toolbox (ROCm 7 + PyTorch nightly)
git clone https://github.com/professor-moody/halo-forge.git
cd halo-forge/toolbox && ./build.sh

# Create and enter the container
toolbox create halo-forge --image localhost/halo-forge:latest
toolbox enter halo-forge

# Validate installation
halo-forge test --level smoke      # Quick check, no GPU
halo-forge test --level standard   # Full validation

# Run demo benchmark
halo-forge benchmark full --model Qwen/Qwen2.5-Coder-0.5B --cycles 2

Full Quick Start Guide →

Graduated Rewards

Binary rewards create sparse gradients. halo-forge uses graduated rewards:

Outcome	Reward	Signal
Syntax error	0.0	Completely wrong
Compiles with warnings	0.3	Close but imperfect
Compiles clean	0.5	Correct syntax
Runs without crash	0.7	Executable
Correct output	1.0	Fully correct

Documentation

→

Quick Start

Get running in under 30 minutes

Full Pipeline

Complete training workflow

Theory

RAFT research foundations

Verifiers

GCC, MinGW, pytest, custom

Configuration

Full config reference

Hardware Notes

Strix Halo configuration

Related Projects

malagent — Applies RLVR to security research (EDR evasion with Elastic Security as verifier)