Documentation

Complete documentation for halo-forge RLVR training framework

What is halo-forge?

halo-forge is an RLVR (Reinforcement Learning from Verifiable Rewards) framework that uses compiler feedback as reward signals for iterative model refinement.

The Problem

Approach	Limitation
SFT only	Distribution mismatch — model outputs differ from training data
RLHF	Expensive human labeling, inconsistent judgments
Self-evaluation	Models hallucinate correctness, signals can be gamed

The Solution

A compiler provides a perfect reward signal — unambiguous, deterministic feedback about code correctness that cannot be gamed.

Architecture

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐
│   Data   │ → │   SFT    │ → │   RAFT   │ → │ Benchmark │
└──────────┘    └──────────┘    └──────────┘    └───────────┘

Data — Gather training examples from public datasets or LLM generation
SFT — Supervised fine-tuning to establish baseline capability
RAFT — Iterative verification loop: generate → verify → filter → train
Benchmark — Evaluate with pass@k metrics

Results

Production training on Qwen2.5-Coder-7B with 569 C/C++ prompts:

Stage	Compile Rate	pass@1
SFT Baseline	15.2%	18.7%
Cycle 1	28.4%	35.2%
Cycle 3	39.7%	48.2%
Cycle 6 (Peak)	46.7%	55.3%

3x improvement over 6 RAFT cycles.

Getting Started

Quick Start — Get running in 30 minutes
Toolbox Setup — Build the container environment
Hardware Notes — Strix Halo configuration

Training Pipeline

Full Pipeline — Complete training workflow
Data Generation — Prepare training data
SFT Training — Supervised fine-tuning
RAFT Training — Reward-ranked fine-tuning
Benchmarking — Evaluate with pass@k

Verifiers

Verifier Overview — Choose your verification strategy
Compile Verifiers — GCC, Clang, MinGW, MSVC
Test Verifiers — pytest, unittest
Custom Verifiers — Build your own

Reference

Configuration — Complete config reference
CLI Reference — Command-line interface
Troubleshooting — Common issues

Background

Theory & Research — Research foundations
RAFT vs PPO vs GRPO — Algorithm comparison
Graduated Rewards — Partial credit system
Learning Rate Strategies — LR recommendations

Meta

Changelog — Version history
Contributing — How to contribute

Configuration

Complete configuration reference

Full Pipeline

Complete guide to training a code generation model

Quick Start

Get halo-forge running in under 30 minutes

Theory & Research

RLVR paradigm and research foundations

CLI Reference

Complete command-line interface reference

Data Generation

Preparing training data for SFT and RAFT

RAFT vs PPO vs GRPO

Comparing reinforcement learning approaches for code generation

SFT Training

Supervised fine-tuning to establish baseline capability

Toolbox Setup

Build and configure the halo-forge container environment

Troubleshooting

Common issues and solutions

Graduated Rewards

Why partial credit matters for RLVR training

Hardware Notes

Configuration for AMD Strix Halo

RAFT Training

Reward-Ranked Fine-Tuning with compiler verification

Learning Rate Strategies

Experimental learning rate recommendations for RAFT training

Benchmarking

Evaluate model performance with pass@k metrics

Changelog

All notable changes to halo-forge

Contributing

How to contribute to halo-forge

Verifiers

Pluggable verification system for RLVR training