Full Pipeline

Complete guide to training a code generation model

Overview

1
Data Generation
Public datasets (MBPP, CodeForces) or LLM-generated examples
2
SFT Training
LoRA fine-tuning to establish baseline capability (~15-25% compile rate)
3
RAFT Training
Generate → Verify → Filter → Train → Repeat
5-6 cycles, ~2 hours each
4
Benchmark
pass@k evaluation (~45-55% compile rate after RAFT)

Step 1: Data Generation

Option A: Public Datasets

# List available datasets
halo-forge data prepare --list

# Download CodeForces C++ examples
halo-forge data prepare --dataset codeforces_cpp --output data/train.jsonl

Option B: LLM Generation

export DEEPSEEK_API_KEY=your_key_here

# Generate Rust async examples
halo-forge data generate \
  --topic rust_async \
  --backend deepseek \
  --output data/rust.jsonl

Data Format

{
  "text": "<|im_start|>system\nYou are an expert programmer.<|im_end|>\n<|im_start|>user\nWrite a function to...<|im_end|>\n<|im_start|>assistant\n```cpp\n#include...\n```<|im_end|>"
}

Step 2: SFT Training

Supervised fine-tuning establishes baseline capability:

# Using HuggingFace dataset (recommended)
halo-forge sft train \
  --dataset codealpaca \
  --model Qwen/Qwen2.5-Coder-7B \
  --output models/sft \
  --epochs 3

# Or using local data
halo-forge sft train \
  --data data/train.jsonl \
  --output models/sft \
  --epochs 3

Why SFT First?

StageCompile Rate
Base Qwen 7B~5%
After SFT~15-25%
After RAFT~45-55%

RAFT filters model outputs. Without SFT, there’s nothing useful to filter.

Step 3: RAFT Training

Iterative verification loop:

halo-forge raft train \
  --checkpoint models/sft/final_model \
  --prompts data/prompts.jsonl \
  --verifier gcc \
  --cycles 5 \
  --samples-per-prompt 8 \
  --temperature 0.7 \
  --output models/raft

Or use a preset config:

halo-forge raft train \
  --config configs/raft_conservative.yaml \
  --prompts data/prompts.jsonl \
  --output models/raft

RAFT Parameters

ParameterDefaultDescription
--cycles5Number of RAFT iterations
--samples-per-prompt8Samples to generate per prompt
--temperature0.7Generation diversity
--max-new-tokens1024Max tokens per completion
--reward-threshold0.5Minimum reward to keep
--keep-percent0.5Top % of samples above threshold
--min-samples-Auto-adjust threshold if too few pass

Cycle Dynamics

Cycle 1: Generate → Verify → Filter (keep 40%) → Train
Cycle 2: Generate → Verify → Filter (keep 50%) → Train
Cycle 3: Generate → Verify → Filter (keep 55%) → Train
...

Each cycle improves the model’s ability to generate code that passes verification.

Step 4: Benchmark

Evaluate the trained model:

halo-forge benchmark run \
  --model models/raft/cycle_5_final \
  --prompts data/test.jsonl \
  --verifier gcc \
  --samples 10 \
  --k 1,5,10

pass@k Metrics

  • pass@1: Probability first sample is correct
  • pass@5: Probability at least 1 of 5 samples is correct
  • pass@10: Probability at least 1 of 10 samples is correct

Complete Example: Code Domain

# 1. SFT Training (using HuggingFace dataset)
halo-forge sft train \
  --dataset codealpaca \
  --model Qwen/Qwen2.5-Coder-3B \
  --output models/code_sft \
  --epochs 2

# 2. RAFT Training (HumanEval uses Python, so use humaneval verifier)
halo-forge raft train \
  --model models/code_sft \
  --prompts data/rlvr/humaneval_prompts.jsonl \
  --verifier humaneval \
  --cycles 5 \
  --output models/code_raft

# 3. Benchmark
halo-forge benchmark run \
  --model models/code_raft \
  --prompts data/rlvr/humaneval_prompts.jsonl \
  --verifier humaneval \
  --samples 10

Full Pipeline: All Domains

Reasoning (Math)

# SFT → RAFT → Benchmark
halo-forge reasoning sft --dataset metamath --model Qwen/Qwen2.5-3B-Instruct --output models/reasoning_sft
halo-forge reasoning train --model models/reasoning_sft --dataset gsm8k --cycles 5 --output models/reasoning_raft --seed 42
halo-forge reasoning benchmark --model models/reasoning_raft --dataset gsm8k

Audio (ASR)

# SFT → RAFT → Benchmark
halo-forge audio sft --dataset librispeech_sft --model openai/whisper-small --output models/audio_sft
halo-forge audio train --model models/audio_sft --dataset librispeech --task asr --cycles 3 --output models/audio_raft --seed 42
halo-forge audio benchmark --model models/audio_raft --dataset librispeech --task asr

VLM (Vision-Language)

# SFT → RAFT → Benchmark
halo-forge vlm sft --dataset llava --model Qwen/Qwen2-VL-2B-Instruct --output models/vlm_sft
halo-forge vlm train --model models/vlm_sft --dataset textvqa --cycles 3 --output models/vlm_raft --seed 42
halo-forge vlm benchmark --model models/vlm_raft --dataset textvqa

Agentic (Tool Calling)

# SFT → RAFT → Benchmark
halo-forge agentic sft --dataset xlam_sft --model Qwen/Qwen2.5-7B-Instruct --output models/agentic_sft
halo-forge agentic train --model models/agentic_sft --dataset xlam --cycles 3 --output models/agentic_raft --seed 42
halo-forge agentic benchmark --model models/agentic_raft --dataset xlam

UI Operations Notes

When launched from the web UI:

  • Each run writes <output_dir>/launch_context.json for durable rerun/clone actions.
  • Monitor and Results can rerun jobs even after UI restart if launch context exists.
  • Resume Latest is available only for cycle-based training jobs (raft, vlm, audio, reasoning, agentic) and uses checkpoint metadata from the run output.
  • Benchmark jobs support Rerun and Clone to Form, but not checkpoint resume.

Timeline

PhaseDurationNotes
Data prep5-10 minDepends on dataset size
SFT1-2 hours3 epochs, 7B model
RAFT (5 cycles)8-12 hours~2 hours per cycle
Benchmark30-60 minDepends on samples

Total: ~12-16 hours for complete pipeline.