Documentation

Complete documentation for halo-forge RLVR training framework

What is halo-forge?

halo-forge is an RLVR (Reinforcement Learning from Verifiable Rewards) framework that uses compiler feedback as reward signals for iterative model refinement.

The Problem

ApproachLimitation
SFT onlyDistribution mismatch — model outputs differ from training data
RLHFExpensive human labeling, inconsistent judgments
Self-evaluationModels hallucinate correctness, signals can be gamed

The Solution

A compiler provides a perfect reward signal — unambiguous, deterministic feedback about code correctness that cannot be gamed.

Architecture

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐
│   Data   │ → │   SFT    │ → │   RAFT   │ → │ Benchmark │
└──────────┘    └──────────┘    └──────────┘    └───────────┘
  1. Data — Gather training examples from public datasets or LLM generation
  2. SFT — Supervised fine-tuning to establish baseline capability
  3. RAFT — Iterative verification loop: generate → verify → filter → train
  4. Benchmark — Evaluate with pass@k metrics

Results

Production training on Qwen2.5-Coder-7B with 569 C/C++ prompts:

StageCompile Ratepass@1
SFT Baseline15.2%18.7%
Cycle 128.4%35.2%
Cycle 339.7%48.2%
Cycle 6 (Peak)46.7%55.3%

3x improvement over 6 RAFT cycles.

Quick Navigation

Getting Started

Training Pipeline

Verifiers

Reference

Background

Meta