How to Train

Complete guide to training code generation models with halo forge

This guide walks you through training a code generation model from scratch using RAFT. Start with the quick start for immediate results, then explore advanced sections for optimization.

TL;DR - Quick Start (10 minutes)

Already have the toolbox built? Run training immediately:

# 1. Enter the toolbox
toolbox enter halo-forge

# 2. Install halo-forge
cd ~/projects/halo-forge && pip install -e .

# 3. Run smoke test
halo-forge test --level smoke

# 4. Start RAFT training (quick validation)
halo-forge raft train \
    --model Qwen/Qwen2.5-Coder-0.5B \
    --prompts data/rlvr/mbpp_train_prompts.jsonl \
    --verifier mbpp \
    --cycles 2 \
    --output models/quick_test

That’s it. Training will begin and produce checkpoints as it progresses.

Prerequisites Checklist

Before training, ensure you have:

Hardware

Component	Minimum	Recommended
GPU	24GB VRAM	48GB+ (Strix Halo)
RAM	32GB	64GB+
Storage	50GB SSD	200GB NVMe
Network	Stable connection	Fast for model downloads

Software

Platform	Requirements
Fedora	Fedora 42+, podman, toolbox
Ubuntu	Ubuntu 22.04+, Docker
Kernel	6.16+ (for gfx1151 without parameters)

Part 1: Setup

Option A: Fedora with podman toolbox

# Clone repository
git clone https://github.com/professor-moody/halo-forge.git
cd halo-forge/toolbox

# Build toolbox
./build.sh --no-cache

# Create and enter
toolbox create halo-forge --image localhost/halo-forge:latest
toolbox enter halo-forge

# Install package
cd ~/projects/halo-forge
pip install -e .

Option B: Ubuntu with Docker (Experimental)

Note: Ubuntu/Docker support is experimental. Fedora toolbox is recommended for production.

# Clone repository
git clone https://github.com/professor-moody/halo-forge.git
cd halo-forge/toolbox

# Build Docker image
./build-ubuntu.sh --no-cache

# (If GPU not visible) Add udev rules
sudo tee /etc/udev/rules.d/99-amd-kfd.rules >/dev/null <<'EOF'
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
EOF
sudo udevadm control --reload-rules && sudo udevadm trigger

# Run container
docker run -it --device=/dev/kfd --device=/dev/dri \
  --security-opt seccomp=unconfined \
  -v ~/projects:/workspace \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  halo-forge:ubuntu

# Inside container
cd /workspace/halo-forge
pip install -e .

Verify Setup

# Quick validation (5 seconds, no GPU)
halo-forge test --level smoke

# Standard validation (2-3 minutes, loads model)
halo-forge test --level standard

# Full validation (5 minutes, includes training step)
halo-forge test --level full

Expected output:

============================================================
halo-forge Standard Test
Model: Qwen/Qwen2.5-Coder-0.5B
============================================================

  [OK] Import modules (0.0s)
  [OK] Compiler available (0.0s)
  [OK] GPU available (0.0s)
  [OK] Model loading (1.2s)
  [OK] Code generation (21.6s)
  [OK] Code verification (0.3s)

============================================================
Test Results: 6/6 passed
============================================================

Part 2: Data Preparation

Option A: Use Built-in Sample Data (Quick Start)

halo-forge includes ready-to-use sample datasets for immediate testing. No download required:

Python Datasets (RLVR):

Dataset	File	Examples	Use For
MBPP Training	`data/rlvr/mbpp_train_prompts.jsonl`	374	RAFT training
MBPP Full	`data/rlvr/mbpp_train_full.jsonl`	374	SFT training
MBPP Validation	`data/rlvr/mbpp_validation.jsonl`	50	Benchmarking
HumanEval	`data/rlvr/humaneval_full.jsonl`	164	Evaluation

C++ Datasets (Competitive Programming):

Dataset	File	Examples	Use For
CodeForces C++	`data/samples/codeforces_cpp_500.jsonl`	500	Raw prompt/response
CodeForces SFT	`data/samples/codeforces_cpp_500_sft.jsonl`	500	SFT training
CodeForces Prompts	`data/samples/codeforces_cpp_500_prompts.jsonl`	500	RAFT training

Windows Systems Programming (Curriculum Learning):

Dataset	File	Examples	Use For
Full RLVR	`datasets/windows_curriculum/windows_systems_full_rlvr.jsonl`	361	RAFT with MinGW/MSVC
Full SFT	`datasets/windows_curriculum/windows_systems_full_sft.jsonl`	361	SFT training
Tier Order	`datasets/windows_curriculum/curriculum_order_full.json`	-	Curriculum scheduling

This dataset covers Windows API programming across 4 difficulty tiers:

Tier 1 (84): Foundations - basic APIs, file I/O, registry
Tier 2 (128): Core APIs - processes, threads, memory, IPC
Tier 3 (72): Intermediate - PE parsing, security, services
Tier 4 (77): Advanced - native APIs, internals, evasion

Quick test - Windows with MinGW (no Windows machine needed):

# Install MinGW cross-compiler
sudo dnf install mingw64-gcc-c++  # Fedora
# or: sudo apt install mingw-w64  # Ubuntu

# Benchmark with MinGW
halo-forge benchmark run \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
  --verifier mingw \
  --samples 10 \
  --output results/windows/baseline.json

# RAFT training with MinGW
halo-forge raft train \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
  --verifier mingw \
  --cycles 3 \
  --output models/windows_raft

Note: MinGW can only verify compilation, not execution. For full verification (compile + run + output check), use MSVC with a Windows build server. See docs/WINDOWS_SETUP.md for setup.

Quick test - Python with MBPP:

# Start RAFT training immediately
halo-forge raft train \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --prompts data/rlvr/mbpp_train_prompts.jsonl \
  --verifier mbpp \
  --cycles 2 \
  --output models/quick_test_python

Quick test - C++ with CodeForces:

# SFT on CodeForces C++
halo-forge sft train \
  --data data/samples/codeforces_cpp_500_sft.jsonl \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --output models/quick_sft_cpp \
  --epochs 1

# Then RAFT with GCC verification
halo-forge raft train \
  --checkpoint models/quick_sft_cpp/final_model \
  --prompts data/samples/codeforces_cpp_500_prompts.jsonl \
  --verifier gcc \
  --cycles 2 \
  --output models/quick_raft_cpp

Validate Your Data

Before training, validate your dataset format:

# Check format and get statistics
halo-forge data validate data/samples/codeforces_cpp_500_sft.jsonl

# With preview of examples
halo-forge data validate data/my_dataset.jsonl --preview

Expected output:

============================================================
DATASET VALIDATION REPORT
============================================================

Status: ✓ VALID
Format: sft

Examples:
  Total:   500
  Valid:   500
  Invalid: 0
...

Option B: Download Public Datasets

# List available datasets
halo-forge data prepare --list

# Download CodeForces C++ examples
halo-forge data prepare \
  --dataset codeforces_cpp \
  --output data/codeforces.jsonl

# Download MBPP Python examples
halo-forge data prepare \
  --dataset mbpp \
  --output data/mbpp.jsonl

Option C: Generate with LLM

# List available topics
halo-forge data generate --list

# Generate with DeepSeek (requires API key)
export DEEPSEEK_API_KEY=your_key
halo-forge data generate \
  --topic python_algorithms \
  --backend deepseek \
  --output data/generated.jsonl

# Generate with local Ollama (free)
halo-forge data generate \
  --topic rust_basics \
  --backend ollama \
  --model codellama:13b \
  --output data/rust.jsonl

Create Prompts File

For RAFT training, you need a JSONL file with prompts:

# Extract prompts from training data
cat data/train.jsonl | python3 -c "
import json, sys
for line in sys.stdin:
    d = json.loads(line)
    prompt = d.get('prompt', d.get('text', ''))[:500]
    if prompt:
        print(json.dumps({'prompt': prompt}))
" > data/prompts.jsonl

Or create manually:

{"prompt": "Write a Python function to calculate factorial"}
{"prompt": "Implement binary search in Python"}
{"prompt": "Write a function to check if a string is palindrome"}

Part 3: SFT Training (Optional but Recommended)

SFT (Supervised Fine-Tuning) creates a baseline before RAFT. While optional if using a pre-trained coder model, SFT is highly recommended for domain-specific training. It helps the model learn your specific code style, patterns, and requirements before RAFT refinement.

Example: Complete SFT Run with CodeForces

Here’s a complete example using CodeForces C++ data — a real competitive programming dataset:

# Step 1: Download CodeForces C++ dataset (~4000 examples)
halo-forge data prepare \
  --dataset codeforces_cpp \
  --output data/codeforces_cpp.jsonl

# Step 2: Run SFT training
halo-forge sft train \
  --data data/codeforces_cpp.jsonl \
  --model Qwen/Qwen2.5-Coder-7B \
  --output models/sft_codeforces \
  --epochs 2

# Step 3: Extract prompts for RAFT
cat data/codeforces_cpp.jsonl | python3 -c "
import json, sys
for line in sys.stdin:
    d = json.loads(line)
    if 'prompt' in d:
        print(json.dumps({'prompt': d['prompt'][:2000]}))
" > data/codeforces_prompts.jsonl

# Step 4: Run RAFT with GCC verification
halo-forge raft train \
  --checkpoint models/sft_codeforces/final_model \
  --prompts data/codeforces_prompts.jsonl \
  --verifier gcc \
  --cycles 5 \
  --output models/raft_codeforces

Available Public Datasets

Dataset	Command	Language	Examples	Description
`codeforces_cpp`	`--dataset codeforces_cpp`	C++	~4000	Competitive programming
`codeforces_python`	`--dataset codeforces_python`	Python	~1000	Competitive programming
`codeforces_rust`	`--dataset codeforces_rust`	Rust	~500	Competitive programming
`mbpp`	`--dataset mbpp`	Python	~500	Basic programming
`humaneval`	`--dataset humaneval`	Python	164	Evaluation benchmark

Generate Custom Data with LLM

For domain-specific training, generate examples using LLMs:

# Available topics
halo-forge data generate --list

# Generate Rust async examples (requires DEEPSEEK_API_KEY)
export DEEPSEEK_API_KEY=your_key
halo-forge data generate \
  --topic rust_async \
  --backend deepseek \
  --output data/rust_async.jsonl

# Generate C++ algorithms
halo-forge data generate \
  --topic cpp_algorithms \
  --backend deepseek \
  --output data/cpp_algo.jsonl

# Generate with local Ollama (free, no API key)
halo-forge data generate \
  --topic python_testing \
  --backend ollama \
  --model codellama:13b \
  --output data/python_tests.jsonl

Topic	Language	What It Generates
`rust_async`	Rust	Async/await with tokio
`python_testing`	Python	pytest examples
`cpp_algorithms`	C++	Algorithm implementations
`go_concurrency`	Go	Goroutines, channels

Basic SFT

halo-forge sft train \
  --data data/train.jsonl \
  --output models/sft \
  --epochs 3

SFT with Configuration

Create configs/sft.yaml:

model:
  name: Qwen/Qwen2.5-Coder-7B
  trust_remote_code: true

data:
  train_file: data/train.jsonl
  max_seq_length: 2048

lora:
  r: 16
  alpha: 32
  dropout: 0.05

training:
  output_dir: models/sft
  num_train_epochs: 3
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 16
  learning_rate: 2e-4
  bf16: true
  gradient_checkpointing: true
  
  # Critical for Strix Halo
  dataloader_num_workers: 0
  dataloader_pin_memory: false

halo-forge sft train --config configs/sft.yaml

Part 4: RAFT Training

RAFT (Reward-rAnked Fine-Tuning) improves the model through iterative verification.

Understanding the RAFT Cycle

┌─────────────────────────────────────────────────────┐
│               RAFT TRAINING CYCLE                    │
├─────────────────────────────────────────────────────┤
│                                                      │
│  GENERATE ──► VERIFY ──► FILTER ──► TRAIN           │
│      │           │          │          │            │
│      ▼           ▼          ▼          ▼            │
│  8 samples   Compile    Keep top    Fine-tune       │
│  per prompt  + Test     by reward   on winners      │
│                                                      │
│  ◄─────────── REPEAT 5-6 TIMES ────────────────►    │
└─────────────────────────────────────────────────────┘

Basic RAFT Training

halo-forge raft train \
  --model Qwen/Qwen2.5-Coder-7B \
  --prompts data/rlvr/mbpp_train_prompts.jsonl \
  --verifier mbpp \
  --cycles 5 \
  --output models/raft

RAFT with Custom Checkpoint

# Start from your SFT model
halo-forge raft train \
  --checkpoint models/sft/final_model \
  --prompts data/prompts.jsonl \
  --verifier gcc \
  --cycles 5 \
  --output models/raft

Choosing a Verifier

Verifier	Language	Target	Compile	Run	Requires
`gcc`	C/C++	Linux ELF	Yes	Yes	gcc/g++
`clang`	C/C++	Linux ELF	Yes	Yes	clang/clang++
`mingw`	C/C++	Windows PE	Yes	No	mingw-w64
`msvc`	C/C++	Windows PE	Yes	Yes	Windows build server
`rust`	Rust	Native	Yes	Yes	cargo
`go`	Go	Native	Yes	Yes	go
`humaneval`	Python	N/A	N/A	Yes	(built-in)
`mbpp`	Python	N/A	N/A	Yes	(built-in)

All compilation verifiers support binary_cache_dir to save compiled binaries for later analysis.

Monitoring Progress

Watch the training output:

RAFT CYCLE 1/5
==============
Generating samples... 374 prompts × 8 samples
Verifying 2992 samples...
  Passed: 1023 (34.2%)
  Failed: 1969

Filtering samples...
  Kept: 512 samples (top 50% above threshold)

Training on filtered samples...
  Loss: 0.856 → 0.342

Saving checkpoint to models/raft/cycle_1_final/

Key Metrics to Watch:

Pass rate: Higher is better - indicates model improvement
Loss decrease: Should trend downward across cycles
Kept samples: More samples = more training signal

TensorBoard Monitoring

Training automatically logs to TensorBoard. View training curves in real-time:

# In a separate terminal (inside toolbox)
tensorboard --logdir models/raft --port 6006

# If remote, forward the port
ssh -L 6006:localhost:6006 user@your-host

Open http://localhost:6006 in your browser to see:

Loss curves — Training loss per step
Learning rate — LR schedule over time
GPU metrics — Memory and utilization (if available)

TensorBoard logs are saved to:

SFT: models/sft/logs/
RAFT: models/raft/cycle_N/logs/

When to Stop

Monitor pass rate across cycles:

Cycle 1: 34.2% pass rate
Cycle 2: 42.1% pass rate  (+7.9%)
Cycle 3: 48.5% pass rate  (+6.4%)
Cycle 4: 51.2% pass rate  (+2.7%)
Cycle 5: 52.1% pass rate  (+0.9%)  ← Diminishing returns
Cycle 6: 51.8% pass rate  (-0.3%)  ← Stop here

General guidance:

Stop when improvement < 2% per cycle
Stop if pass rate decreases
In our testing, 5-6 cycles often worked well

Part 5: Benchmarking

Run Benchmark

halo-forge benchmark run \
  --model models/raft/cycle_5_final \
  --prompts data/rlvr/mbpp_validation.jsonl \
  --verifier mbpp \
  --samples 10 \
  --k 1,5,10 \
  --output results/benchmark.json

Compare Models

# Benchmark baseline
halo-forge benchmark run \
  --model Qwen/Qwen2.5-Coder-7B \
  --prompts data/rlvr/mbpp_validation.jsonl \
  --output results/baseline.json

# Benchmark RAFT model
halo-forge benchmark run \
  --model models/raft/cycle_5_final \
  --prompts data/rlvr/mbpp_validation.jsonl \
  --output results/raft.json

# Compare
python3 -c "
import json
for name in ['baseline', 'raft']:
    with open(f'results/{name}.json') as f:
        data = json.load(f)
        print(f'{name}: pass@1={data[\"pass_at_k\"][\"1\"]:.1%}')
"

Full Benchmark Suite

# Run full benchmark with before/after comparison
halo-forge benchmark full \
  --model Qwen/Qwen2.5-Coder-7B \
  --prompts data/rlvr/mbpp_train_prompts.jsonl \
  --verifier mbpp \
  --cycles 5 \
  --output results/full_benchmark

Part 6: Advanced Topics

Filtering Strategies

Control which samples are used for training:

# Keep top 50% of samples above 0.5 reward (default)
halo-forge raft train --keep-percent 0.5 --reward-threshold 0.5 ...

# Selective: top 20% only (large datasets)
halo-forge raft train --keep-percent 0.2 --reward-threshold 0.5 ...

# Inclusive: keep all passing (small datasets)
halo-forge raft train --keep-percent 1.0 --reward-threshold 0.3 ...

Curriculum Learning

Increase difficulty over cycles:

# configs/curriculum.yaml
curriculum_strategy: progressive

cycles:
  - verifier: gcc
    reward_threshold: 0.3
  - verifier: gcc
    reward_threshold: 0.5
  - verifier: gcc
    run_after_compile: true
    reward_threshold: 0.7

Custom Verifier

from halo_forge.rlvr.verifiers import Verifier, VerifyResult

class MyVerifier(Verifier):
    def verify(self, code: str) -> VerifyResult:
        # Your verification logic
        success = your_check(code)
        return VerifyResult(
            success=success,
            reward=1.0 if success else 0.0,
            details="Custom verification"
        )

Hyperparameter Tuning

Parameter	Default	Tuning Notes
`samples_per_prompt`	8	More = better diversity, slower
`temperature`	0.7	Higher = more diverse, lower quality
`reward_threshold`	0.5	Higher = stricter filtering
`keep_top_percent`	0.5	Lower = more selective
`learning_rate`	5e-5	Lower if unstable

Memory Optimization (Strix Halo)

For unified memory systems:

training:
  batch_size: 2
  gradient_accumulation: 16
  bf16: true  # NOT 4-bit (slower on Strix Halo)
  gradient_checkpointing: true
  
  # Critical for unified memory
  dataloader_num_workers: 0
  dataloader_pin_memory: false

Troubleshooting

Low Pass Rate

Symptoms: <20% pass rate, many syntax errors

Solutions:

Check prompt quality - are they asking for complete code?
Lower temperature for more consistent output
Add few-shot examples to prompts
Run SFT first to establish baseline

Training Loss Increasing

Symptoms: Loss goes up after cycle 4-5

Solutions:

Stop training - you’ve peaked
Lower learning rate
Increase reward_threshold to filter stricter
Try learning rate decay

GPU Hang

Symptoms: Training freezes, GPU unresponsive

Solutions:

Ensure dataloader_num_workers: 0
Ensure dataloader_pin_memory: false
Add export HSA_ENABLE_SDMA=0

Out of Memory

Symptoms: CUDA/ROCm OOM errors

Solutions:

Reduce batch_size
Enable gradient_checkpointing
Use smaller model
Reduce max_seq_length

Automatic Resume

RAFT automatically caches progress. If a run crashes:

# Just re-run the same command
halo-forge raft train --cycles 5 --output models/raft

# Output:
# Cycle 1 already complete, skipping...
# Cycle 2 already complete, skipping...
# Loading cached samples... (resumes cycle 3)

See Troubleshooting for more solutions.

Next Steps

RAFT Details — Deep dive into RAFT algorithm
Verifiers — All verifier options
Configuration — Complete config reference
Theory & Research — Research foundations