Full Pipeline

Complete guide to training a code generation model

Overview

1
Data Generation
Public datasets (MBPP, CodeForces) or LLM-generated examples
2
SFT Training
LoRA fine-tuning to establish baseline capability (~15-25% compile rate)
3
RAFT Training
Generate → Verify → Filter → Train → Repeat
5-6 cycles, ~2 hours each
4
Benchmark
pass@k evaluation (~45-55% compile rate after RAFT)

Step 1: Data Generation

Option A: Public Datasets

# List available datasets
halo-forge data prepare --list

# Download CodeForces C++ examples
halo-forge data prepare --dataset codeforces_cpp --output data/train.jsonl

Option B: LLM Generation

export DEEPSEEK_API_KEY=your_key_here

# Generate Rust async examples
halo-forge data generate \
  --topic rust_async \
  --backend deepseek \
  --output data/rust.jsonl

Data Format

{
  "text": "<|im_start|>system\nYou are an expert programmer.<|im_end|>\n<|im_start|>user\nWrite a function to...<|im_end|>\n<|im_start|>assistant\n```cpp\n#include...\n```<|im_end|>"
}

Step 2: SFT Training

Supervised fine-tuning establishes baseline capability:

halo-forge sft train \
  --data data/train.jsonl \
  --output models/sft \
  --epochs 3

Why SFT First?

StageCompile Rate
Base Qwen 7B~5%
After SFT~15-25%
After RAFT~45-55%

RAFT filters model outputs. Without SFT, there’s nothing useful to filter.

Step 3: RAFT Training

Iterative verification loop:

halo-forge raft train \
  --checkpoint models/sft/final_model \
  --prompts data/prompts.jsonl \
  --verifier gcc \
  --cycles 5 \
  --samples-per-prompt 8 \
  --output models/raft

RAFT Parameters

ParameterDefaultDescription
cycles5Number of RAFT iterations
samples-per-prompt8Samples to generate per prompt
reward-threshold0.5Minimum reward to keep
keep-top-percent0.5Top % of samples above threshold

Cycle Dynamics

Cycle 1: Generate → Verify → Filter (keep 40%) → Train
Cycle 2: Generate → Verify → Filter (keep 50%) → Train
Cycle 3: Generate → Verify → Filter (keep 55%) → Train
...

Each cycle improves the model’s ability to generate code that passes verification.

Step 4: Benchmark

Evaluate the trained model:

halo-forge benchmark run \
  --model models/raft/cycle_5_final \
  --prompts data/test.jsonl \
  --verifier gcc \
  --samples 10 \
  --k 1,5,10

pass@k Metrics

  • pass@1: Probability first sample is correct
  • pass@5: Probability at least 1 of 5 samples is correct
  • pass@10: Probability at least 1 of 10 samples is correct

Complete Example

# 1. Prepare data
halo-forge data prepare --dataset codeforces_cpp --output data/train.jsonl

# 2. Extract prompts for RAFT
head -200 data/train.jsonl | jq -c '{prompt: .text | split("<|im_start|>user\n")[1] | split("<|im_end|>")[0]}' > data/prompts.jsonl

# 3. Run SFT
halo-forge sft train \
  --data data/train.jsonl \
  --output models/sft \
  --epochs 3

# 4. Run RAFT
halo-forge raft train \
  --checkpoint models/sft/final_model \
  --prompts data/prompts.jsonl \
  --verifier gcc \
  --cycles 5 \
  --output models/raft

# 5. Benchmark
halo-forge benchmark run \
  --model models/raft/cycle_5_final \
  --prompts data/test.jsonl \
  --verifier gcc

Timeline

PhaseDurationNotes
Data prep5-10 minDepends on dataset size
SFT1-2 hours3 epochs, 7B model
RAFT (5 cycles)8-12 hours~2 hours per cycle
Benchmark30-60 minDepends on samples

Total: ~12-16 hours for complete pipeline.