Documentation

Complete documentation for halo-forge RLVR training framework

What is halo-forge?

halo-forge is an RLVR (Reinforcement Learning from Verifiable Rewards) framework that uses compiler feedback as reward signals for iterative model refinement.

The Problem

ApproachLimitation
SFT onlyDistribution mismatch — model outputs differ from training data
RLHFExpensive human labeling, inconsistent judgments
Self-evaluationModels hallucinate correctness, signals can be gamed

The Approach

A compiler provides deterministic feedback — objective, reproducible results about code correctness.

Architecture

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐
│   Data   │ → │   SFT    │ → │   RAFT   │ → │ Benchmark │
└──────────┘    └──────────┘    └──────────┘    └───────────┘
  1. Data — Gather training examples from public datasets or LLM generation
  2. SFT — Supervised fine-tuning to establish baseline capability
  3. RAFT — Iterative verification loop: generate → verify → filter → train
  4. Benchmark — Evaluate with pass@k metrics

Example Results

Results from test runs on Qwen2.5-Coder-7B with 569 C/C++ prompts. Your results will vary based on model, dataset, hardware, and configuration.

StageCompile Ratepass@1
SFT Baseline15.2%18.7%
Cycle 128.4%35.2%
Cycle 339.7%48.2%
Cycle 646.7%55.3%

Improvements were observed over 6 RAFT cycles in our testing.

Quick Navigation

Getting Started

Training Pipeline

Verifiers

Reference

Background

Meta