SFT Training

Supervised fine-tuning to establish baseline capability

Supervised Fine-Tuning establishes baseline capabilities before RAFT.

Basic Usage

halo-forge sft train \
  --data data/train.jsonl \
  --output models/sft \
  --model Qwen/Qwen2.5-Coder-7B \
  --epochs 3

Configuration

# configs/sft.yaml
model:
  name: Qwen/Qwen2.5-Coder-7B
  
data:
  train_file: data/train.jsonl
  max_seq_length: 2048

training:
  num_train_epochs: 3
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 16
  learning_rate: 2e-5
  warmup_ratio: 0.03
  bf16: true
  gradient_checkpointing: true
  dataloader_num_workers: 0      # Required for Strix Halo
  dataloader_pin_memory: false   # Required for Strix Halo

lora:
  r: 64
  alpha: 128
  dropout: 0.05
  target_modules:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj

output:
  dir: models/sft
  save_steps: 500
  logging_steps: 10

LoRA Settings

ParameterValueNotes
r64Rank, higher = more capacity
alpha128Scaling, typically 2x rank
dropout0.05Regularization
target_modulesall linearFull coverage

Learning Rate

  • Start with 2e-5 for 7B models
  • Use 5e-5 for smaller models (0.5B-3B)
  • Warmup helps stability

Batch Size

Effective batch size = per_device_batch_size × gradient_accumulation_steps

For 7B on Strix Halo:

  • per_device_train_batch_size: 2
  • gradient_accumulation_steps: 16
  • Effective: 32

Early Stopping

Watch the loss curve. If validation loss stops decreasing:

training:
  early_stopping: true
  early_stopping_patience: 3

Output Structure

models/sft/
├── checkpoint-500/
├── checkpoint-1000/
├── final_model/
│   ├── adapter_config.json
│   ├── adapter_model.safetensors
│   └── tokenizer/
└── training_log.json

Resuming Training

halo-forge sft train \
  --config configs/sft.yaml \
  --resume models/sft/checkpoint-1000

Why SFT First?

RAFT works by filtering model outputs. If the base model can’t produce any valid code, there’s nothing to filter.

StageCompile Rate
Base Qwen 7B~5%
After SFT~15-25%
After RAFT~45-55%

SFT creates the foundation that RAFT refines.