How to Train
Complete guide to training code generation models with halo forge
This guide walks you through training a code generation model from scratch using RAFT. Start with the quick start for immediate results, then explore advanced sections for optimization.
TL;DR - Quick Start (10 minutes)
Already have the toolbox built? Run training immediately:
# 1. Enter the toolbox
toolbox enter halo-forge
# 2. Install halo-forge
cd ~/projects/halo-forge && pip install -e .
# 3. Run smoke test
halo-forge test --level smoke
# 4. Start RAFT training (quick validation)
halo-forge raft train \
--model Qwen/Qwen2.5-Coder-0.5B \
--prompts data/rlvr/mbpp_train_prompts.jsonl \
--verifier mbpp \
--cycles 2 \
--output models/quick_test
That’s it. Training will begin and produce checkpoints as it progresses.
Prerequisites Checklist
Before training, ensure you have:
Hardware
| Component | Minimum | Recommended |
|---|---|---|
| GPU | 24GB VRAM | 48GB+ (Strix Halo) |
| RAM | 32GB | 64GB+ |
| Storage | 50GB SSD | 200GB NVMe |
| Network | Stable connection | Fast for model downloads |
Software
| Platform | Requirements |
|---|---|
| Fedora | Fedora 42+, podman, toolbox |
| Ubuntu | Ubuntu 22.04+, Docker |
| Kernel | 6.16+ (for gfx1151 without parameters) |
Part 1: Setup
Option A: Fedora with podman toolbox
# Clone repository
git clone https://github.com/professor-moody/halo-forge.git
cd halo-forge/toolbox
# Build toolbox
./build.sh --no-cache
# Create and enter
toolbox create halo-forge --image localhost/halo-forge:latest
toolbox enter halo-forge
# Install package
cd ~/projects/halo-forge
pip install -e .
Option B: Ubuntu with Docker (Experimental)
Note: Ubuntu/Docker support is experimental. Fedora toolbox is recommended for production.
# Clone repository
git clone https://github.com/professor-moody/halo-forge.git
cd halo-forge/toolbox
# Build Docker image
./build-ubuntu.sh --no-cache
# (If GPU not visible) Add udev rules
sudo tee /etc/udev/rules.d/99-amd-kfd.rules >/dev/null <<'EOF'
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
EOF
sudo udevadm control --reload-rules && sudo udevadm trigger
# Run container
docker run -it --device=/dev/kfd --device=/dev/dri \
--security-opt seccomp=unconfined \
-v ~/projects:/workspace \
-v ~/.cache/huggingface:/root/.cache/huggingface \
halo-forge:ubuntu
# Inside container
cd /workspace/halo-forge
pip install -e .
Verify Setup
# Quick validation (5 seconds, no GPU)
halo-forge test --level smoke
# Standard validation (2-3 minutes, loads model)
halo-forge test --level standard
# Full validation (5 minutes, includes training step)
halo-forge test --level full
Expected output:
============================================================
halo-forge Standard Test
Model: Qwen/Qwen2.5-Coder-0.5B
============================================================
[OK] Import modules (0.0s)
[OK] Compiler available (0.0s)
[OK] GPU available (0.0s)
[OK] Model loading (1.2s)
[OK] Code generation (21.6s)
[OK] Code verification (0.3s)
============================================================
Test Results: 6/6 passed
============================================================
Part 2: Data Preparation
Option A: Use Built-in Sample Data (Quick Start)
halo-forge includes ready-to-use sample datasets for immediate testing. No download required:
Python Datasets (RLVR):
| Dataset | File | Examples | Use For |
|---|---|---|---|
| MBPP Training | data/rlvr/mbpp_train_prompts.jsonl | 374 | RAFT training |
| MBPP Full | data/rlvr/mbpp_train_full.jsonl | 374 | SFT training |
| MBPP Validation | data/rlvr/mbpp_validation.jsonl | 50 | Benchmarking |
| HumanEval | data/rlvr/humaneval_full.jsonl | 164 | Evaluation |
C++ Datasets (Competitive Programming):
| Dataset | File | Examples | Use For |
|---|---|---|---|
| CodeForces C++ | data/samples/codeforces_cpp_500.jsonl | 500 | Raw prompt/response |
| CodeForces SFT | data/samples/codeforces_cpp_500_sft.jsonl | 500 | SFT training |
| CodeForces Prompts | data/samples/codeforces_cpp_500_prompts.jsonl | 500 | RAFT training |
Windows Systems Programming (Curriculum Learning):
| Dataset | File | Examples | Use For |
|---|---|---|---|
| Full RLVR | datasets/windows_curriculum/windows_systems_full_rlvr.jsonl | 361 | RAFT with MinGW/MSVC |
| Full SFT | datasets/windows_curriculum/windows_systems_full_sft.jsonl | 361 | SFT training |
| Tier Order | datasets/windows_curriculum/curriculum_order_full.json | - | Curriculum scheduling |
This dataset covers Windows API programming across 4 difficulty tiers:
- Tier 1 (84): Foundations - basic APIs, file I/O, registry
- Tier 2 (128): Core APIs - processes, threads, memory, IPC
- Tier 3 (72): Intermediate - PE parsing, security, services
- Tier 4 (77): Advanced - native APIs, internals, evasion
Quick test - Windows with MinGW (no Windows machine needed):
# Install MinGW cross-compiler
sudo dnf install mingw64-gcc-c++ # Fedora
# or: sudo apt install mingw-w64 # Ubuntu
# Benchmark with MinGW
halo-forge benchmark run \
--model Qwen/Qwen2.5-Coder-0.5B \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--samples 10 \
--output results/windows/baseline.json
# RAFT training with MinGW
halo-forge raft train \
--model Qwen/Qwen2.5-Coder-0.5B \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--cycles 3 \
--output models/windows_raft
Note: MinGW can only verify compilation, not execution. For full verification (compile + run + output check), use MSVC with a Windows build server. See docs/WINDOWS_SETUP.md for setup.
Quick test - Python with MBPP:
# Start RAFT training immediately
halo-forge raft train \
--model Qwen/Qwen2.5-Coder-0.5B \
--prompts data/rlvr/mbpp_train_prompts.jsonl \
--verifier mbpp \
--cycles 2 \
--output models/quick_test_python
Quick test - C++ with CodeForces:
# SFT on CodeForces C++
halo-forge sft train \
--data data/samples/codeforces_cpp_500_sft.jsonl \
--model Qwen/Qwen2.5-Coder-0.5B \
--output models/quick_sft_cpp \
--epochs 1
# Then RAFT with GCC verification
halo-forge raft train \
--checkpoint models/quick_sft_cpp/final_model \
--prompts data/samples/codeforces_cpp_500_prompts.jsonl \
--verifier gcc \
--cycles 2 \
--output models/quick_raft_cpp
Validate Your Data
Before training, validate your dataset format:
# Check format and get statistics
halo-forge data validate data/samples/codeforces_cpp_500_sft.jsonl
# With preview of examples
halo-forge data validate data/my_dataset.jsonl --preview
Expected output:
============================================================
DATASET VALIDATION REPORT
============================================================
Status: ✓ VALID
Format: sft
Examples:
Total: 500
Valid: 500
Invalid: 0
...
Option B: Download Public Datasets
# List available datasets
halo-forge data prepare --list
# Download CodeForces C++ examples
halo-forge data prepare \
--dataset codeforces_cpp \
--output data/codeforces.jsonl
# Download MBPP Python examples
halo-forge data prepare \
--dataset mbpp \
--output data/mbpp.jsonl
Option C: Generate with LLM
# List available topics
halo-forge data generate --list
# Generate with DeepSeek (requires API key)
export DEEPSEEK_API_KEY=your_key
halo-forge data generate \
--topic python_algorithms \
--backend deepseek \
--output data/generated.jsonl
# Generate with local Ollama (free)
halo-forge data generate \
--topic rust_basics \
--backend ollama \
--model codellama:13b \
--output data/rust.jsonl
Create Prompts File
For RAFT training, you need a JSONL file with prompts:
# Extract prompts from training data
cat data/train.jsonl | python3 -c "
import json, sys
for line in sys.stdin:
d = json.loads(line)
prompt = d.get('prompt', d.get('text', ''))[:500]
if prompt:
print(json.dumps({'prompt': prompt}))
" > data/prompts.jsonl
Or create manually:
{"prompt": "Write a Python function to calculate factorial"}
{"prompt": "Implement binary search in Python"}
{"prompt": "Write a function to check if a string is palindrome"}
Part 3: SFT Training (Optional but Recommended)
SFT (Supervised Fine-Tuning) creates a baseline before RAFT. While optional if using a pre-trained coder model, SFT is highly recommended for domain-specific training. It helps the model learn your specific code style, patterns, and requirements before RAFT refinement.
Example: Complete SFT Run with CodeForces
Here’s a complete example using CodeForces C++ data — a real competitive programming dataset:
# Step 1: Download CodeForces C++ dataset (~4000 examples)
halo-forge data prepare \
--dataset codeforces_cpp \
--output data/codeforces_cpp.jsonl
# Step 2: Run SFT training
halo-forge sft train \
--data data/codeforces_cpp.jsonl \
--model Qwen/Qwen2.5-Coder-7B \
--output models/sft_codeforces \
--epochs 2
# Step 3: Extract prompts for RAFT
cat data/codeforces_cpp.jsonl | python3 -c "
import json, sys
for line in sys.stdin:
d = json.loads(line)
if 'prompt' in d:
print(json.dumps({'prompt': d['prompt'][:2000]}))
" > data/codeforces_prompts.jsonl
# Step 4: Run RAFT with GCC verification
halo-forge raft train \
--checkpoint models/sft_codeforces/final_model \
--prompts data/codeforces_prompts.jsonl \
--verifier gcc \
--cycles 5 \
--output models/raft_codeforces
Available Public Datasets
| Dataset | Command | Language | Examples | Description |
|---|---|---|---|---|
codeforces_cpp | --dataset codeforces_cpp | C++ | ~4000 | Competitive programming |
codeforces_python | --dataset codeforces_python | Python | ~1000 | Competitive programming |
codeforces_rust | --dataset codeforces_rust | Rust | ~500 | Competitive programming |
mbpp | --dataset mbpp | Python | ~500 | Basic programming |
humaneval | --dataset humaneval | Python | 164 | Evaluation benchmark |
Generate Custom Data with LLM
For domain-specific training, generate examples using LLMs:
# Available topics
halo-forge data generate --list
# Generate Rust async examples (requires DEEPSEEK_API_KEY)
export DEEPSEEK_API_KEY=your_key
halo-forge data generate \
--topic rust_async \
--backend deepseek \
--output data/rust_async.jsonl
# Generate C++ algorithms
halo-forge data generate \
--topic cpp_algorithms \
--backend deepseek \
--output data/cpp_algo.jsonl
# Generate with local Ollama (free, no API key)
halo-forge data generate \
--topic python_testing \
--backend ollama \
--model codellama:13b \
--output data/python_tests.jsonl
| Topic | Language | What It Generates |
|---|---|---|
rust_async | Rust | Async/await with tokio |
python_testing | Python | pytest examples |
cpp_algorithms | C++ | Algorithm implementations |
go_concurrency | Go | Goroutines, channels |
Basic SFT
halo-forge sft train \
--data data/train.jsonl \
--output models/sft \
--epochs 3
SFT with Configuration
Create configs/sft.yaml:
model:
name: Qwen/Qwen2.5-Coder-7B
trust_remote_code: true
data:
train_file: data/train.jsonl
max_seq_length: 2048
lora:
r: 16
alpha: 32
dropout: 0.05
training:
output_dir: models/sft
num_train_epochs: 3
per_device_train_batch_size: 2
gradient_accumulation_steps: 16
learning_rate: 2e-4
bf16: true
gradient_checkpointing: true
# Critical for Strix Halo
dataloader_num_workers: 0
dataloader_pin_memory: false
halo-forge sft train --config configs/sft.yaml
Part 4: RAFT Training
RAFT (Reward-rAnked Fine-Tuning) improves the model through iterative verification.
Understanding the RAFT Cycle
┌─────────────────────────────────────────────────────┐
│ RAFT TRAINING CYCLE │
├─────────────────────────────────────────────────────┤
│ │
│ GENERATE ──► VERIFY ──► FILTER ──► TRAIN │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ 8 samples Compile Keep top Fine-tune │
│ per prompt + Test by reward on winners │
│ │
│ ◄─────────── REPEAT 5-6 TIMES ────────────────► │
└─────────────────────────────────────────────────────┘
Basic RAFT Training
halo-forge raft train \
--model Qwen/Qwen2.5-Coder-7B \
--prompts data/rlvr/mbpp_train_prompts.jsonl \
--verifier mbpp \
--cycles 5 \
--output models/raft
RAFT with Custom Checkpoint
# Start from your SFT model
halo-forge raft train \
--checkpoint models/sft/final_model \
--prompts data/prompts.jsonl \
--verifier gcc \
--cycles 5 \
--output models/raft
Choosing a Verifier
| Verifier | Language | Target | Compile | Run | Requires |
|---|---|---|---|---|---|
gcc | C/C++ | Linux ELF | Yes | Yes | gcc/g++ |
clang | C/C++ | Linux ELF | Yes | Yes | clang/clang++ |
mingw | C/C++ | Windows PE | Yes | No | mingw-w64 |
msvc | C/C++ | Windows PE | Yes | Yes | Windows build server |
rust | Rust | Native | Yes | Yes | cargo |
go | Go | Native | Yes | Yes | go |
humaneval | Python | N/A | N/A | Yes | (built-in) |
mbpp | Python | N/A | N/A | Yes | (built-in) |
All compilation verifiers support binary_cache_dir to save compiled binaries for later analysis.
Monitoring Progress
Watch the training output:
RAFT CYCLE 1/5
==============
Generating samples... 374 prompts × 8 samples
Verifying 2992 samples...
Passed: 1023 (34.2%)
Failed: 1969
Filtering samples...
Kept: 512 samples (top 50% above threshold)
Training on filtered samples...
Loss: 0.856 → 0.342
Saving checkpoint to models/raft/cycle_1_final/
Key Metrics to Watch:
- Pass rate: Higher is better - indicates model improvement
- Loss decrease: Should trend downward across cycles
- Kept samples: More samples = more training signal
TensorBoard Monitoring
Training automatically logs to TensorBoard. View training curves in real-time:
# In a separate terminal (inside toolbox)
tensorboard --logdir models/raft --port 6006
# If remote, forward the port
ssh -L 6006:localhost:6006 user@your-host
Open http://localhost:6006 in your browser to see:
- Loss curves — Training loss per step
- Learning rate — LR schedule over time
- GPU metrics — Memory and utilization (if available)
TensorBoard logs are saved to:
- SFT:
models/sft/logs/ - RAFT:
models/raft/cycle_N/logs/
When to Stop
Monitor pass rate across cycles:
Cycle 1: 34.2% pass rate
Cycle 2: 42.1% pass rate (+7.9%)
Cycle 3: 48.5% pass rate (+6.4%)
Cycle 4: 51.2% pass rate (+2.7%)
Cycle 5: 52.1% pass rate (+0.9%) ← Diminishing returns
Cycle 6: 51.8% pass rate (-0.3%) ← Stop here
General guidance:
- Stop when improvement < 2% per cycle
- Stop if pass rate decreases
- In our testing, 5-6 cycles often worked well
Part 5: Benchmarking
Run Benchmark
halo-forge benchmark run \
--model models/raft/cycle_5_final \
--prompts data/rlvr/mbpp_validation.jsonl \
--verifier mbpp \
--samples 10 \
--k 1,5,10 \
--output results/benchmark.json
Compare Models
# Benchmark baseline
halo-forge benchmark run \
--model Qwen/Qwen2.5-Coder-7B \
--prompts data/rlvr/mbpp_validation.jsonl \
--output results/baseline.json
# Benchmark RAFT model
halo-forge benchmark run \
--model models/raft/cycle_5_final \
--prompts data/rlvr/mbpp_validation.jsonl \
--output results/raft.json
# Compare
python3 -c "
import json
for name in ['baseline', 'raft']:
with open(f'results/{name}.json') as f:
data = json.load(f)
print(f'{name}: pass@1={data[\"pass_at_k\"][\"1\"]:.1%}')
"
Full Benchmark Suite
# Run full benchmark with before/after comparison
halo-forge benchmark full \
--model Qwen/Qwen2.5-Coder-7B \
--prompts data/rlvr/mbpp_train_prompts.jsonl \
--verifier mbpp \
--cycles 5 \
--output results/full_benchmark
Part 6: Advanced Topics
Filtering Strategies
Control which samples are used for training:
# Keep top 50% of samples above 0.5 reward (default)
halo-forge raft train --keep-percent 0.5 --reward-threshold 0.5 ...
# Selective: top 20% only (large datasets)
halo-forge raft train --keep-percent 0.2 --reward-threshold 0.5 ...
# Inclusive: keep all passing (small datasets)
halo-forge raft train --keep-percent 1.0 --reward-threshold 0.3 ...
Curriculum Learning
Increase difficulty over cycles:
# configs/curriculum.yaml
curriculum_strategy: progressive
cycles:
- verifier: gcc
reward_threshold: 0.3
- verifier: gcc
reward_threshold: 0.5
- verifier: gcc
run_after_compile: true
reward_threshold: 0.7
Custom Verifier
from halo_forge.rlvr.verifiers import Verifier, VerifyResult
class MyVerifier(Verifier):
def verify(self, code: str) -> VerifyResult:
# Your verification logic
success = your_check(code)
return VerifyResult(
success=success,
reward=1.0 if success else 0.0,
details="Custom verification"
)
Hyperparameter Tuning
| Parameter | Default | Tuning Notes |
|---|---|---|
samples_per_prompt | 8 | More = better diversity, slower |
temperature | 0.7 | Higher = more diverse, lower quality |
reward_threshold | 0.5 | Higher = stricter filtering |
keep_top_percent | 0.5 | Lower = more selective |
learning_rate | 5e-5 | Lower if unstable |
Memory Optimization (Strix Halo)
For unified memory systems:
training:
batch_size: 2
gradient_accumulation: 16
bf16: true # NOT 4-bit (slower on Strix Halo)
gradient_checkpointing: true
# Critical for unified memory
dataloader_num_workers: 0
dataloader_pin_memory: false
Troubleshooting
Low Pass Rate
Symptoms: <20% pass rate, many syntax errors
Solutions:
- Check prompt quality - are they asking for complete code?
- Lower temperature for more consistent output
- Add few-shot examples to prompts
- Run SFT first to establish baseline
Training Loss Increasing
Symptoms: Loss goes up after cycle 4-5
Solutions:
- Stop training - you’ve peaked
- Lower learning rate
- Increase
reward_thresholdto filter stricter - Try learning rate decay
GPU Hang
Symptoms: Training freezes, GPU unresponsive
Solutions:
- Ensure
dataloader_num_workers: 0 - Ensure
dataloader_pin_memory: false - Add
export HSA_ENABLE_SDMA=0
Out of Memory
Symptoms: CUDA/ROCm OOM errors
Solutions:
- Reduce
batch_size - Enable
gradient_checkpointing - Use smaller model
- Reduce
max_seq_length
Automatic Resume
RAFT automatically caches progress. If a run crashes:
# Just re-run the same command
halo-forge raft train --cycles 5 --output models/raft
# Output:
# Cycle 1 already complete, skipping...
# Cycle 2 already complete, skipping...
# Loading cached samples... (resumes cycle 3)
See Troubleshooting for more solutions.
Next Steps
- RAFT Details — Deep dive into RAFT algorithm
- Verifiers — All verifier options
- Configuration — Complete config reference
- Theory & Research — Research foundations