SFT Training

Supervised fine-tuning to establish baseline capability

Supervised Fine-Tuning establishes baseline capabilities before RAFT.

Basic Usage

# List available SFT datasets
halo-forge sft datasets

# Train with HuggingFace dataset
halo-forge sft train \
  --dataset codealpaca \
  --model Qwen/Qwen2.5-Coder-7B \
  --output models/sft \
  --epochs 3

Using Local Data

halo-forge sft train \
  --data data/train.jsonl \
  --output models/sft \
  --model Qwen/Qwen2.5-Coder-7B \
  --epochs 3

Available SFT Datasets

DomainShort NameHuggingFace IDSize
Codecodealpacasahil2801/CodeAlpaca-20k20K
Codecode_instructions_122kTokenBender/code_instructions_122k122K
Reasoningmetamathmeta-math/MetaMathQA395K
Reasoninggsm8k_sftgsm8k8.5K
VLMllavaliuhaotian/LLaVA-Instruct-150K150K
Audiolibrispeech_sftlibrispeech_asr100h
Agenticxlam_sftSalesforce/xlam-function-calling-60k60K
Agenticglaive_sftglaiveai/glaive-function-calling-v2113K

Domain-Specific SFT

Each domain has its own SFT command with optimized defaults:

# VLM
halo-forge vlm sft --dataset llava --model Qwen/Qwen2-VL-2B-Instruct

# Audio
halo-forge audio sft --dataset librispeech_sft --model openai/whisper-small

# Reasoning
halo-forge reasoning sft --dataset metamath --model Qwen/Qwen2.5-3B-Instruct

# Agentic (Tool Calling)
halo-forge agentic sft --dataset xlam_sft --model Qwen/Qwen2.5-7B-Instruct

Configuration

# configs/sft.yaml
model:
  name: Qwen/Qwen2.5-Coder-7B
  
data:
  train_file: data/train.jsonl
  max_seq_length: 2048

training:
  num_train_epochs: 3
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 16
  learning_rate: 2e-5
  warmup_ratio: 0.03
  bf16: true
  gradient_checkpointing: true
  dataloader_num_workers: 0      # Required for Strix Halo
  dataloader_pin_memory: false   # Required for Strix Halo

lora:
  r: 64
  alpha: 128
  dropout: 0.05
  target_modules:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj

output:
  dir: models/sft
  save_steps: 500
  logging_steps: 10

LoRA Settings

ParameterValueNotes
r64Rank, higher = more capacity
alpha128Scaling, typically 2x rank
dropout0.05Regularization
target_modulesall linearFull coverage

Learning Rate

  • Start with 2e-5 for 7B models
  • Use 5e-5 for smaller models (0.5B-3B)
  • Warmup helps stability

Batch Size

Effective batch size = per_device_batch_size × gradient_accumulation_steps

For 7B on Strix Halo:

  • per_device_train_batch_size: 2
  • gradient_accumulation_steps: 16
  • Effective: 32

Early Stopping

Watch the loss curve. If validation loss stops decreasing:

training:
  early_stopping: true
  early_stopping_patience: 3

Output Structure

models/sft/
├── checkpoint-500/
├── checkpoint-1000/
├── final_model/
│   ├── adapter_config.json
│   ├── adapter_model.safetensors
│   └── tokenizer/
└── training_log.json

Resuming Training

halo-forge sft train \
  --config configs/sft.yaml \
  --resume models/sft/checkpoint-1000

Why SFT First?

RAFT works by filtering model outputs. If the base model can’t produce any valid code, there’s nothing to filter.

StageCompile Rate
Base Qwen 7B~5%
After SFT~15-25%
After RAFT~45-55%

SFT creates the foundation that RAFT refines.