Production Training Runs
Step-by-step commands for training all model sizes on the Windows Systems Programming dataset
This guide provides copy-paste commands for running production training on all model sizes using the Windows Systems Programming dataset. Start with the 0.5B model to validate your setup, then scale up.
Quick Reference
| Model | SFT Time | RAFT Time | Total |
|---|---|---|---|
| Qwen2.5-Coder-0.5B | ~30 min | ~1 hour | ~1.5 hours |
| Qwen2.5-Coder-1.5B | ~1 hour | ~2 hours | ~3 hours |
| Qwen2.5-Coder-3B | ~1.5 hours | ~3 hours | ~4.5 hours |
| Qwen2.5-Coder-7B | ~2.5 hours | ~5 hours | ~7.5 hours |
Dataset: Windows Curriculum (361 problems, MinGW/MSVC compatible)
Default Verifier: MinGW (no Windows machine required)
Pre-Flight Checklist
Before starting, verify your environment is ready.
Fedora (Toolbox)
# 1. Check toolbox exists
toolbox list | grep halo-forge
# 2. Check dataset exists
ls -la ~/projects/halo-forge/datasets/windows_curriculum/windows_systems_full_*.jsonl
# 3. Check disk space (need ~50GB free)
df -h ~
# 4. Check GPU
rocm-smi
# 5. Check MinGW is installed
x86_64-w64-mingw32-g++ --version
Ubuntu/Docker
# 1. Check image exists
docker images | grep halo-forge
# or: podman images | grep halo-forge
# 2-5. Same checks as Fedora (run from host before entering container)
Verifier Options
The Windows dataset can be verified with two different compilers:
| Verifier | Platform | Compiles | Executes | Requires |
|---|---|---|---|---|
mingw | Linux | Yes | No | mingw-w64 package |
msvc | Remote | Yes | Optional | Windows build server |
MinGW (Recommended for Getting Started)
Use MinGW for quick compile-only verification on Linux. No Windows machine required.
# Install MinGW (Fedora)
sudo dnf install mingw64-gcc-c++
# Install MinGW (Ubuntu)
sudo apt install mingw-w64
# Quick benchmark with MinGW
halo-forge benchmark run \
--model Qwen/Qwen2.5-Coder-0.5B \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--samples 10 \
--output results/windows/baseline_mingw.json
Limitation: MinGW can only verify that code compiles. It cannot run the executables.
MSVC (Full Verification)
Use MSVC for full compile + run + output verification. Requires a Windows build server.
See the Windows Build Server guide for server configuration.
| Scenario | Recommended Verifier |
|---|---|
| Getting started / no Windows available | mingw (default) |
| Debugging compile issues | mingw (faster iteration) |
| Full output verification | msvc (with run_after_compile=True) |
| Production training (advanced) | msvc |
Recommendation: Start with MinGW. It’s simpler and doesn’t require Windows setup.
Initial Setup
Run these commands once to prepare your environment.
Fedora (Toolbox)
# Enter toolbox
toolbox enter halo-forge
# Navigate to project
cd ~/projects/halo-forge
# Install halo-forge
pip install -e .
# Create results directory
mkdir -p results/windows
Ubuntu/Docker
# From the halo-forge directory on your host:
cd /path/to/halo-forge
# Run container with GPU access (Docker)
docker run -it --device=/dev/kfd --device=/dev/dri \
--security-opt seccomp=unconfined \
-v $(pwd):/workspace/halo-forge \
halo-forge:ubuntu
# Or with podman (rootless, recommended on Fedora host):
podman run -it --userns=keep-id \
--device=/dev/kfd --device=/dev/dri \
--security-opt seccomp=unconfined \
-v $(pwd):/workspace/halo-forge:Z \
halo-forge:ubuntu
# Inside container:
cd /workspace/halo-forge
pip install -e .
mkdir -p results/windows
Once inside the container, all halo-forge commands work identically on both platforms.
Training Pattern
Each model follows the same four-step pattern:
- Baseline - Measure the untrained model’s performance
- SFT - Supervised fine-tuning on solutions (optional but recommended)
- RAFT - Reinforcement learning with compile verification
- Benchmark - Measure the trained model’s performance
All commands below use --verifier mingw by default. Replace with --verifier msvc if you have a Windows build server configured.
0.5B Model (Start Here)
The smallest model - use this to validate your setup before scaling up.
Baseline Benchmark
halo-forge benchmark run \
--model Qwen/Qwen2.5-Coder-0.5B \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--samples 10 \
--output results/windows/baseline_0.5b.json
SFT Training (~30 min)
screen -S sft_0.5b
halo-forge sft train \
--data datasets/windows_curriculum/windows_systems_full_sft.jsonl \
--model Qwen/Qwen2.5-Coder-0.5B \
--output models/windows_sft_0.5b \
--epochs 2
# Detach: Ctrl+A, D
RAFT Training (~1 hour)
screen -S raft_0.5b
halo-forge raft train \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--model Qwen/Qwen2.5-Coder-0.5B \
--checkpoint models/windows_sft_0.5b/final_model \
--cycles 6 \
--output models/windows_raft_0.5b
# Detach: Ctrl+A, D
Final Benchmark
halo-forge benchmark run \
--model models/windows_raft_0.5b/cycle_6_final \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--samples 10 \
--k 1,5,10 \
--output results/windows/trained_0.5b.json
Compare Results
echo "=== 0.5B Results ==="
echo "Baseline:" && cat results/windows/baseline_0.5b.json | jq '.pass_at_k'
echo "Trained:" && cat results/windows/trained_0.5b.json | jq '.pass_at_k'
1.5B Model
Baseline Benchmark
halo-forge benchmark run \
--model Qwen/Qwen2.5-Coder-1.5B \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--samples 10 \
--output results/windows/baseline_1.5b.json
SFT Training (~1 hour)
screen -S sft_1.5b
halo-forge sft train \
--data datasets/windows_curriculum/windows_systems_full_sft.jsonl \
--model Qwen/Qwen2.5-Coder-1.5B \
--output models/windows_sft_1.5b \
--epochs 2
# Detach: Ctrl+A, D
RAFT Training (~2 hours)
screen -S raft_1.5b
halo-forge raft train \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--model Qwen/Qwen2.5-Coder-1.5B \
--checkpoint models/windows_sft_1.5b/final_model \
--cycles 6 \
--output models/windows_raft_1.5b
# Detach: Ctrl+A, D
Final Benchmark
halo-forge benchmark run \
--model models/windows_raft_1.5b/cycle_6_final \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--samples 10 \
--k 1,5,10 \
--output results/windows/trained_1.5b.json
Compare Results
echo "=== 1.5B Results ==="
echo "Baseline:" && cat results/windows/baseline_1.5b.json | jq '.pass_at_k'
echo "Trained:" && cat results/windows/trained_1.5b.json | jq '.pass_at_k'
3B Model
Baseline Benchmark
halo-forge benchmark run \
--model Qwen/Qwen2.5-Coder-3B \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--samples 10 \
--output results/windows/baseline_3b.json
SFT Training (~1.5 hours)
screen -S sft_3b
halo-forge sft train \
--data datasets/windows_curriculum/windows_systems_full_sft.jsonl \
--model Qwen/Qwen2.5-Coder-3B \
--output models/windows_sft_3b \
--epochs 2
# Detach: Ctrl+A, D
RAFT Training (~3 hours)
screen -S raft_3b
halo-forge raft train \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--model Qwen/Qwen2.5-Coder-3B \
--checkpoint models/windows_sft_3b/final_model \
--cycles 6 \
--output models/windows_raft_3b
# Detach: Ctrl+A, D
Final Benchmark
halo-forge benchmark run \
--model models/windows_raft_3b/cycle_6_final \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--samples 10 \
--k 1,5,10 \
--output results/windows/trained_3b.json
Compare Results
echo "=== 3B Results ==="
echo "Baseline:" && cat results/windows/baseline_3b.json | jq '.pass_at_k'
echo "Trained:" && cat results/windows/trained_3b.json | jq '.pass_at_k'
7B Model
The largest supported model. Requires gradient checkpointing for memory efficiency.
Baseline Benchmark
halo-forge benchmark run \
--model Qwen/Qwen2.5-Coder-7B \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--samples 10 \
--output results/windows/baseline_7b.json
SFT Training (~2.5 hours)
screen -S sft_7b
# For 7B, use config file to enable gradient checkpointing
halo-forge sft train \
--config configs/production_7b.yaml \
--data datasets/windows_curriculum/windows_systems_full_sft.jsonl \
--output models/windows_sft_7b \
--epochs 2
# Detach: Ctrl+A, D
RAFT Training (~5 hours)
screen -S raft_7b
halo-forge raft train \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--model Qwen/Qwen2.5-Coder-7B \
--checkpoint models/windows_sft_7b/final_model \
--cycles 6 \
--output models/windows_raft_7b
# Detach: Ctrl+A, D
Final Benchmark
halo-forge benchmark run \
--model models/windows_raft_7b/cycle_6_final \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--verifier mingw \
--samples 10 \
--k 1,5,10 \
--output results/windows/trained_7b.json
Compare Results
echo "=== 7B Results ==="
echo "Baseline:" && cat results/windows/baseline_7b.json | jq '.pass_at_k'
echo "Trained:" && cat results/windows/trained_7b.json | jq '.pass_at_k'
Summary Report
After training all models, generate a summary report:
echo "============================================" > results/windows/summary.txt
echo "WINDOWS SYSTEMS PROGRAMMING TRAINING RESULTS" >> results/windows/summary.txt
echo "Date: $(date)" >> results/windows/summary.txt
echo "Dataset: 361 problems (100% compile rate)" >> results/windows/summary.txt
echo "============================================" >> results/windows/summary.txt
for size in 0.5b 1.5b 3b 7b; do
echo "" >> results/windows/summary.txt
echo "--- Qwen2.5-Coder-${size} ---" >> results/windows/summary.txt
if [ -f results/windows/baseline_${size}.json ]; then
echo "Baseline:" >> results/windows/summary.txt
cat results/windows/baseline_${size}.json | jq -r '" pass@1: \(.pass_at_k["1"])"' >> results/windows/summary.txt
fi
if [ -f results/windows/trained_${size}.json ]; then
echo "Trained:" >> results/windows/summary.txt
cat results/windows/trained_${size}.json | jq -r '" pass@1: \(.pass_at_k["1"])"' >> results/windows/summary.txt
fi
done
cat results/windows/summary.txt
Troubleshooting
| Problem | Solution |
|---|---|
| GPU hang | export HSA_ENABLE_SDMA=0 before training |
| OOM error | Use config file with gradient_checkpointing: true or reduce batch size |
| MinGW not found | sudo dnf install mingw64-gcc-c++ or sudo apt install mingw-w64 |
| MSVC verifier timeout | Check Windows server connectivity |
| Training crash | Re-run same command (auto-resume from checkpoint) |
| Slow training | Check GPU usage with radeontop |
Useful Commands
# Check screen sessions
screen -ls
# Reattach to session
screen -r sft_0.5b
# Monitor GPU
radeontop
# Check disk usage
du -sh models/windows_*
# Kill stuck training
pkill -f "halo-forge"
# View training logs
tail -f models/windows_raft_0.5b/training.log
Results Tracking Template
Use this template to track your results:
| Model | Baseline pass@1 | After SFT | After RAFT | Improvement |
|---|---|---|---|---|
| 0.5B | ||||
| 1.5B | ||||
| 3B | ||||
| 7B |
Verifier Configuration
MinGW (Default)
MinGW works out of the box with just the --verifier mingw flag. No config file required.
MSVC (Advanced)
For full compile + run + output verification, create configs/raft_windows_msvc.yaml:
# RAFT with MSVC Verifier for Windows training
base_model: Qwen/Qwen2.5-Coder-0.5B # Override with --model
num_cycles: 6
output_dir: models/raft
verifier:
type: msvc
host: YOUR_WINDOWS_IP
user: YOUR_USERNAME
ssh_key: ~/.ssh/win
Then use --config configs/raft_windows_msvc.yaml with your RAFT commands.
Complete Verifier Reference
Compilation Verifiers
| Verifier | Language | Target | Compile | Run | Binary Cache | Cross-Compile | Requires |
|---|---|---|---|---|---|---|---|
gcc | C/C++ | Linux ELF | Yes | Yes | Yes | - | gcc/g++ |
clang | C/C++ | Linux ELF | Yes | Yes | Yes | - | clang/clang++ |
mingw | C/C++ | Windows PE | Yes | No | Yes | - | mingw-w64 |
msvc | C/C++ | Windows PE | Yes | Yes | Yes | - | Windows build server |
rust | Rust | Native/Windows | Yes | Yes | Yes | x86_64-pc-windows-gnu | cargo, rustup |
go | Go | Native/Windows | Yes | Yes | Yes | GOOS=windows | go |
dotnet | C# | Windows PE | Yes | No | Yes | win-x64 | dotnet SDK |
powershell | PS1 | Script | Syntax | No | Yes | - | pwsh or Windows |
Test Verifiers
| Verifier | Language | Description | Requires |
|---|---|---|---|
pytest | Python | Run pytest tests | pytest |
unittest | Python | Run unittest tests | (built-in) |
humaneval | Python | HumanEval benchmark | (built-in) |
mbpp | Python | MBPP benchmark | (built-in) |
Verifier Quick Reference
# C/C++ on Linux
halo-forge benchmark run --verifier gcc ...
# Windows PE (cross-compile on Linux, no execution)
halo-forge benchmark run --verifier mingw ...
# Windows PE (compile + run on Windows server)
halo-forge benchmark run --verifier msvc ...
# Rust (native)
halo-forge benchmark run --verifier rust ...
# Go (native)
halo-forge benchmark run --verifier go ...
# C#/.NET (cross-compile to Windows)
halo-forge benchmark run --verifier dotnet ...
# PowerShell (syntax check)
halo-forge benchmark run --verifier powershell ...
# Python (MBPP benchmark)
halo-forge benchmark run --verifier mbpp ...
Graduated Rewards
All verifiers return graduated rewards for partial credit:
| Stage | Reward | Description |
|---|---|---|
| Compile fail | 0.0 | Does not compile |
| Compile with warnings | 0.3 | Compiles with warnings |
| Compile clean | 0.5 | Compiles without warnings |
| Runs without crash | 0.7 | Executes successfully |
| Correct output | 1.0 | Output matches expected |