Full Pipeline

Complete guide to training a code generation model

Overview

Data Generation

Public datasets (MBPP, CodeForces) or LLM-generated examples

↓

SFT Training

LoRA fine-tuning to establish baseline capability (~15-25% compile rate)

↓

RAFT Training

Generate → Verify → Filter → Train → Repeat

5-6 cycles, ~2 hours each

↓

Benchmark

pass@k evaluation (~45-55% compile rate after RAFT)

Step 1: Data Generation

Option A: Public Datasets

# List available datasets
halo-forge data prepare --list

# Download CodeForces C++ examples
halo-forge data prepare --dataset codeforces_cpp --output data/train.jsonl

Option B: LLM Generation

export DEEPSEEK_API_KEY=your_key_here

# Generate Rust async examples
halo-forge data generate \
  --topic rust_async \
  --backend deepseek \
  --output data/rust.jsonl

Data Format

{
  "text": "<|im_start|>system\nYou are an expert programmer.<|im_end|>\n<|im_start|>user\nWrite a function to...<|im_end|>\n<|im_start|>assistant\n```cpp\n#include...\n```<|im_end|>"
}

Step 2: SFT Training

Supervised fine-tuning establishes baseline capability:

# Using HuggingFace dataset (recommended)
halo-forge sft train \
  --dataset codealpaca \
  --model Qwen/Qwen2.5-Coder-7B \
  --output models/sft \
  --epochs 3

# Or using local data
halo-forge sft train \
  --data data/train.jsonl \
  --output models/sft \
  --epochs 3

Why SFT First?

Stage	Compile Rate
Base Qwen 7B	~5%
After SFT	~15-25%
After RAFT	~45-55%

RAFT filters model outputs. Without SFT, there’s nothing useful to filter.

Step 3: RAFT Training

Iterative verification loop:

halo-forge raft train \
  --checkpoint models/sft/final_model \
  --prompts data/prompts.jsonl \
  --verifier gcc \
  --cycles 5 \
  --samples-per-prompt 8 \
  --temperature 0.7 \
  --output models/raft

Or use a preset config:

halo-forge raft train \
  --config configs/raft_conservative.yaml \
  --prompts data/prompts.jsonl \
  --output models/raft

RAFT Parameters

Parameter	Default	Description
`--cycles`	5	Number of RAFT iterations
`--samples-per-prompt`	8	Samples to generate per prompt
`--temperature`	0.7	Generation diversity
`--max-new-tokens`	1024	Max tokens per completion
`--reward-threshold`	0.5	Minimum reward to keep
`--keep-percent`	0.5	Top % of samples above threshold
`--min-samples`	-	Auto-adjust threshold if too few pass

Cycle Dynamics

Cycle 1: Generate → Verify → Filter (keep 40%) → Train
Cycle 2: Generate → Verify → Filter (keep 50%) → Train
Cycle 3: Generate → Verify → Filter (keep 55%) → Train
...

Each cycle improves the model’s ability to generate code that passes verification.

Step 4: Benchmark

Evaluate the trained model:

halo-forge benchmark run \
  --model models/raft/cycle_5_final \
  --prompts data/test.jsonl \
  --verifier gcc \
  --samples 10 \
  --k 1,5,10

pass@k Metrics

pass@1: Probability first sample is correct
pass@5: Probability at least 1 of 5 samples is correct
pass@10: Probability at least 1 of 10 samples is correct

Complete Example: Code Domain

# 1. SFT Training (using HuggingFace dataset)
halo-forge sft train \
  --dataset codealpaca \
  --model Qwen/Qwen2.5-Coder-3B \
  --output models/code_sft \
  --epochs 2

# 2. RAFT Training (HumanEval uses Python, so use humaneval verifier)
halo-forge raft train \
  --model models/code_sft \
  --prompts data/rlvr/humaneval_prompts.jsonl \
  --verifier humaneval \
  --cycles 5 \
  --output models/code_raft

# 3. Benchmark
halo-forge benchmark run \
  --model models/code_raft \
  --prompts data/rlvr/humaneval_prompts.jsonl \
  --verifier humaneval \
  --samples 10

Full Pipeline: All Domains

Reasoning (Math)

# SFT → RAFT → Benchmark
halo-forge reasoning sft --dataset metamath --model Qwen/Qwen2.5-3B-Instruct --output models/reasoning_sft
halo-forge reasoning train --model models/reasoning_sft --dataset gsm8k --cycles 5 --output models/reasoning_raft --seed 42
halo-forge reasoning benchmark --model models/reasoning_raft --dataset gsm8k

Audio (ASR)

# SFT → RAFT → Benchmark
halo-forge audio sft --dataset librispeech_sft --model openai/whisper-small --output models/audio_sft
halo-forge audio train --model models/audio_sft --dataset librispeech --task asr --cycles 3 --output models/audio_raft --seed 42
halo-forge audio benchmark --model models/audio_raft --dataset librispeech --task asr

VLM (Vision-Language)

# SFT → RAFT → Benchmark
halo-forge vlm sft --dataset llava --model Qwen/Qwen2-VL-2B-Instruct --output models/vlm_sft
halo-forge vlm train --model models/vlm_sft --dataset textvqa --cycles 3 --output models/vlm_raft --seed 42
halo-forge vlm benchmark --model models/vlm_raft --dataset textvqa

Agentic (Tool Calling)

# SFT → RAFT → Benchmark
halo-forge agentic sft --dataset xlam_sft --model Qwen/Qwen2.5-7B-Instruct --output models/agentic_sft
halo-forge agentic train --model models/agentic_sft --dataset xlam --cycles 3 --output models/agentic_raft --seed 42
halo-forge agentic benchmark --model models/agentic_raft --dataset xlam

UI Operations Notes

When launched from the web UI:

Each run writes <output_dir>/launch_context.json for durable rerun/clone actions.
Monitor and Results can rerun jobs even after UI restart if launch context exists.
Resume Latest is available only for cycle-based training jobs (raft, vlm, audio, reasoning, agentic) and uses checkpoint metadata from the run output.
Benchmark jobs support Rerun and Clone to Form, but not checkpoint resume.

Timeline

Phase	Duration	Notes
Data prep	5-10 min	Depends on dataset size
SFT	1-2 hours	3 epochs, 7B model
RAFT (5 cycles)	8-12 hours	~2 hours per cycle
Benchmark	30-60 min	Depends on samples

Total: ~12-16 hours for complete pipeline.