RAFT Training

Reward-Ranked Fine-Tuning with compiler verification

RAFT (Reward-Ranked Fine-Tuning) is the core of halo forge’s RLVR approach.

The Algorithm

for cycle in range(num_cycles):
    1. Generate N samples per prompt
    2. Verify all samples with compiler
    3. Filter to samples with reward >= threshold
    4. Fine-tune on filtered samples
    5. Repeat with updated model

Basic Usage

halo-forge raft train \
  --checkpoint models/sft/final_model \
  --prompts data/prompts.jsonl \
  --verifier gcc \
  --cycles 5 \
  --output models/raft

Configuration

# configs/raft.yaml
sft_checkpoint: models/sft/final_model
output_dir: models/raft
prompts: data/prompts.jsonl

raft:
  num_cycles: 5
  samples_per_prompt: 8
  reward_threshold: 0.5
  keep_top_percent: 0.5

generation:
  max_new_tokens: 1024
  temperature: 0.7
  top_p: 0.95
  batch_size: 4

training:
  epochs: 1
  batch_size: 2
  gradient_accumulation_steps: 16
  learning_rate: 5e-5

verifier:
  type: gcc
  run_after_compile: false

Key Parameters

samples_per_prompt

How many completions to generate for each prompt.

Value	Exploration	Compute
4	Low	Fast
8	Medium	Balanced
16	High	Slow

reward_threshold

Minimum reward to keep a sample.

Value	Effect
0.3	Keep samples with warnings
0.5	Keep clean compiles
0.7	Keep samples that run
1.0	Only correct outputs

temperature

Controls generation diversity.

Value	Effect
0.3	Conservative, similar outputs
0.7	Balanced
1.0	Diverse, more exploration

Sample Filtering Strategy

RAFT generates many samples and filters them before training. The filtering strategy affects what the model learns.

How Filtering Works

1. Generate N samples per prompt (e.g., 374 prompts × 8 samples = 2,992 samples)
2. Verify all samples → each gets a reward score (0.0 to 1.0)
3. Filter by reward_threshold → keep samples with reward >= threshold
4. Sort by reward (highest first)
5. Keep top keep_top_percent of filtered samples
6. Train on final filtered set

keep_top_percent

After filtering by threshold, what percentage of passing samples to keep.

# CLI usage
halo-forge raft train --keep-percent 0.5 ...  # Keep top 50%

Value	Effect	Use Case
0.2	Very selective (top 20%)	Large datasets, selective filtering
0.5	Balanced (top 50%)	Default, works well generally
0.8	Inclusive (top 80%)	Small datasets, need more data
1.0	Keep all passing	Very small datasets

Filtering Examples

Example 1: Default settings (50% threshold, 50% keep)

Generated: 2,992 samples
Pass threshold (≥0.5): 1,200 samples (40% compile rate)
Keep top 50%: 600 samples → Training set

Example 2: Selective filtering (50% threshold, 20% keep)

Generated: 2,992 samples
Pass threshold (≥0.5): 1,200 samples
Keep top 20%: 240 samples → Training on top performers

Example 3: Inclusive filtering (30% threshold, 100% keep)

Generated: 2,992 samples
Pass threshold (≥0.3): 1,800 samples (including warnings)
Keep top 100%: 1,800 samples → Maximum training data

Choosing Your Strategy

Small dataset (< 500 prompts):

--keep-percent 0.8 --reward-threshold 0.5

Keep more data since you have fewer prompts to begin with.

Large dataset (> 1000 prompts):

--keep-percent 0.3 --reward-threshold 0.5

Be selective - train only on high-quality samples.

Low initial compile rate (< 20%):

--keep-percent 1.0 --reward-threshold 0.3

Keep everything that compiles (even with warnings) to maximize training signal.

High initial compile rate (> 50%):

--keep-percent 0.5 --reward-threshold 0.7

Be selective and only train on samples that actually execute.

CLI Reference

halo-forge raft train \
  --model Qwen/Qwen2.5-Coder-7B \
  --prompts data/prompts.jsonl \
  --verifier humaneval \
  --cycles 5 \
  --samples-per-prompt 8 \    # Samples to generate per prompt
  --temperature 0.7 \         # Generation diversity
  --max-new-tokens 1024 \     # Max tokens per completion
  --keep-percent 0.5 \        # Keep top 50% of passing samples
  --reward-threshold 0.5 \    # Min reward to pass
  --min-samples 200 \         # Auto-adjust threshold if too few pass
  --output models/production

Configuration File

# configs/raft.yaml
raft:
  samples_per_prompt: 8
  reward_threshold: 0.5      # Minimum reward to keep sample
  keep_top_percent: 0.5      # Keep top 50% above threshold

Cycle-by-Cycle Output

models/raft/
├── cycle_1/
│   ├── samples.jsonl      # All generated samples
│   ├── kept.jsonl         # Filtered samples used for training
│   ├── checkpoint/        # Model after training
│   └── stats.json         # Verification statistics
├── cycle_2/
│   └── ...
├── cycle_3_final/         # Cycle 3 checkpoint
└── training_log.json

Monitoring Progress

Progress Display

halo forge provides real-time progress during generation:

> Generating batch 13/47 ━━━━━━━━━━ 13/47 • 28% • 1:13:10 • 0.15 it/s

13/47: Current batch / total batches
28%: Percentage complete
1:13:10: Time elapsed
0.15 it/s: Iterations per second (batches processed)

Compile Rate Patterns

Watch for these patterns:

Healthy training (example):

Cycle 1: 25% compile rate (kept 160/640 samples)
Cycle 2: 32% compile rate (kept 205/640 samples)
Cycle 3: 38% compile rate (kept 243/640 samples)

Plateauing (example):

Cycle 4: 40% compile rate
Cycle 5: 41% compile rate
Cycle 6: 40% compile rate  # Consider stopping

Degradation (example):

Cycle 7: 35% compile rate  # Stop and use cycle 6

When to Stop

Diminishing returns: < 2% improvement per cycle
Degradation: Performance drops
Observation: In our testing, 5-6 cycles often worked well

If you see degradation, consider learning rate decay.

Graduated Rewards

halo forge uses graduated rewards for better gradient flow:

Outcome	Reward
Syntax error	0.0
Compiles with warnings	0.3
Compiles clean	0.5
Runs without crash	0.7
Correct output	1.0

Caching & Automatic Resume

RAFT automatically caches progress at each stage, allowing seamless resume after crashes or interruptions.

Cache Files

For each cycle, these files are saved to your output directory:

models/raft/
├── cycle_1_samples.jsonl    # All generated completions
├── cycle_1_verified.jsonl   # Verification results with rewards
├── cycle_1_final/           # Trained model checkpoint
│   ├── adapter_config.json
│   └── adapter_model.safetensors
├── cycle_2_samples.jsonl
├── cycle_2_verified.jsonl
├── cycle_2_final/
└── ...

Automatic Resume Behavior

When you re-run the same command, RAFT automatically detects existing cache files:

If this exists…	RAFT will…
`cycle_N_final/`	Skip entire cycle, load checkpoint
`cycle_N_samples.jsonl`	Skip generation, load cached samples
`cycle_N_verified.jsonl`	Skip verification, load cached results
Nothing	Start from scratch

Resume Example

# Original run crashes during cycle 2 verification
halo-forge raft train --model Qwen/Qwen2.5-Coder-7B --cycles 5 --output models/raft

# State after crash:
# - cycle_1_final/ exists (complete)
# - cycle_2_samples.jsonl exists (generation done)
# - cycle_2_verified.jsonl missing (crashed during verification)

# Resume with SAME command:
halo-forge raft train --model Qwen/Qwen2.5-Coder-7B --cycles 5 --output models/raft

# RAFT will:
# 1. Skip cycle 1 entirely (loads cycle_1_final checkpoint)
# 2. Load cycle_2_samples.jsonl (skip 8+ hours of generation!)
# 3. Resume verification from start
# 4. Continue normally

What Gets Logged

Cycle 1 already complete, skipping...
Loading cached samples...
Loaded 2992 samples from cache
Verifying 2992 samples...

Recovery Time Savings

Crash Point	Time Saved on Resume
During generation batch 50/100	0 (generation restarts)
After generation, during verification	~4-8 hours (full generation)
After verification, during training	~8-10 hours (generation + verification)
After training completes	Entire cycle skipped

Manual Cache Management

Clear cache to restart a cycle:

rm models/raft/cycle_3_*  # Remove cycle 3 cache files

Force fresh start:

rm -rf models/raft/cycle_*  # Remove all cache files

Known Limitation

Currently, mid-generation resume is not supported. If a crash occurs during generation (before all batches complete), that cycle’s generation restarts from scratch. The samples are only cached after all generation completes.

Tip: For very long runs, consider using screen or tmux to prevent terminal disconnection issues.

Memory Management

RAFT uses chunked verification to prevent OOM:

# Verify in chunks of 200
chunk_size = 200
for i in range(0, len(samples), chunk_size):
    chunk = samples[i:i+chunk_size]
    results.extend(verifier.verify_batch(chunk))
    gc.collect()  # Force garbage collection

Advanced: Different Verifiers per Cycle

cycles:
  - verifier: gcc
    reward_threshold: 0.3
  - verifier: gcc
    reward_threshold: 0.5
  - verifier: gcc
    run_after_compile: true
    reward_threshold: 0.7

This creates curriculum learning: easier criteria early, harder later.

Verifier Choice

Verifier	Language	Use Case
`gcc`	C/C++	Linux compilation
`mingw`	C/C++	Windows cross-compile
`clang`	C/C++	Alternative to GCC
`humaneval`	Python	HumanEval benchmark tests
`mbpp`	Python	MBPP benchmark tests
`python`	Python	Generic Python execution
`rust`	Rust	Rust compilation
`go`	Go	Go compilation
`custom`	Any	Your own verifier

Important: Match your verifier to your dataset! Using --verifier gcc with Python prompts will cause 100% failure.

See Verifiers for details.

Tips

Start with 5 cycles — More isn’t always better
Watch for degradation — Stop if cycle N+1 is worse than N
Use 8 samples/prompt — Good balance of diversity and compute
Temperature 0.7 — Enough diversity without garbage
Monitor GPU memory — RAFT is memory-intensive