Documentation

Complete documentation for halo forge RLVR training framework

What is halo forge?

halo forge is an RLVR (Reinforcement Learning from Verifiable Rewards) framework that uses compiler feedback as reward signals for iterative model refinement.

The Problem

ApproachLimitation
SFT onlyDistribution mismatch — model outputs differ from training data
RLHFExpensive human labeling, inconsistent judgments
Self-evaluationModels hallucinate correctness, signals can be gamed

The Approach

A compiler provides deterministic feedback — objective, reproducible results about code correctness.

Architecture

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐
│   Data   │ ─► │   SFT    │ ─► │   RAFT   │ ─► │ Benchmark │
└──────────┘    └──────────┘    └──────────┘    └───────────┘
  1. Data — Gather training examples from public datasets or LLM generation
  2. SFT — Supervised fine-tuning to establish baseline capability
  3. RAFT — Iterative verification loop: generate → verify → filter → train
  4. Benchmark — Evaluate with pass@k metrics

What to Expect

RAFT training typically shows:

CycleWhat Happens
1-2Largest gains as model learns basic patterns
3-4Continued improvement at slower rate
5-6Diminishing returns; monitor for plateau
7+May see degradation; consider stopping earlier

Results vary significantly based on model, dataset, hardware, and domain. Run benchmarks to measure improvement on your specific use case.

Quick Navigation

Getting Started

Training Pipeline

Verifiers

Reference

Background

Experimental

Features under active development and testing:

Meta