VLM Training
⚠️ Experimental Feature: VLM training is under active development. APIs may change.
Vision-Language Model Training
Train vision-language models (VLMs) using RLVR with perception-aware verification.
Overview
Phase 3 of halo-forge extends the RAFT training framework to support vision-language models. This enables training models like Qwen-VL and LLaVA on visual question answering tasks with graduated reward signals.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ VLM RAFT Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │
│ │ Image │───▶│ VLM │───▶│ Completion │ │
│ │ Prompt │ │ Model │ │ │ │
│ └──────────┘ └──────────┘ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Multi-Stage Verification │ │
│ ├──────────────┬──────────────┬────────────────────────┤ │
│ │ Perception │ Reasoning │ Output │ │
│ │ (0.3) │ (0.4) │ (0.3) │ │
│ │ │ │ │ │
│ │ • Objects │ • Structure │ • Exact match │ │
│ │ • OCR │ • Consistency│ • Fuzzy match │ │
│ │ • Spatial │ • Grounding │ • Semantic sim │ │
│ │ • Counting │ │ │ │
│ └──────────────┴──────────────┴────────────────────────┘ │
│ │ │
│ ▼ │
│ Combined Reward │
│ (0.0 - 1.0) │
│ │
└─────────────────────────────────────────────────────────────┘
Quick Start
Train on TextVQA
halo-forge vlm train \
--model Qwen/Qwen2-VL-7B-Instruct \
--dataset textvqa \
--cycles 6 \
--output models/vlm_raft
Benchmark a Model
halo-forge vlm benchmark \
--model models/vlm_raft/cycle_6 \
--dataset docvqa \
--limit 100
List Available Datasets
halo-forge vlm datasets
Supported Models
| Model | Adapter | Notes |
|---|---|---|
| Qwen/Qwen2-VL-2B-Instruct | qwen_vl | Lightweight |
| Qwen/Qwen2-VL-7B-Instruct | qwen_vl | Recommended |
| Qwen/Qwen2-VL-72B-Instruct | qwen_vl | Requires 128GB+ |
| llava-hf/llava-1.5-7b-hf | llava | Good baseline |
| llava-hf/llava-v1.6-34b-hf | llava | High quality |
Supported Datasets
| Dataset | Task | Size |
|---|---|---|
| TextVQA | Text reading in images | 45K train |
| DocVQA | Document understanding | 50K train |
| ChartQA | Chart interpretation | 28K train |
| RealWorldQA | Real-world reasoning | 700 test |
| MathVista | Math with visuals | 6K+ test |
Key Features
Perception Verification
The PerceptionChecker validates visual claims:
- Object Detection: Uses YOLOv8 to verify claimed objects exist
- OCR Verification: Uses EasyOCR to verify text extraction
- Spatial Reasoning: Validates “left of”, “above”, etc.
- Counting: Verifies object counts
Reasoning Verification
The ReasoningChecker validates chain-of-thought quality:
- Structure: Proper reasoning steps
- Consistency: No contradictions
- Grounding: References to visual evidence
Output Verification
The OutputChecker validates final answers:
- Exact Match: Direct comparison
- Fuzzy Match: Similarity-based
- Semantic: Embedding similarity (optional)
Configuration
from halo_forge.vlm.trainer import VLMRAFTConfig, VLMRAFTTrainer
config = VLMRAFTConfig(
model_name="Qwen/Qwen2-VL-7B-Instruct",
num_cycles=6,
samples_per_prompt=4,
# Verification weights
perception_weight=0.3,
reasoning_weight=0.4,
output_weight=0.3,
# Training
learning_rate=5e-5,
lr_decay_per_cycle=0.85,
temperature=0.7,
)
trainer = VLMRAFTTrainer(config)
trainer.train("textvqa")
Memory Requirements
| Model Size | Training Memory | Inference Memory |
|---|---|---|
| 2B VLM | ~40 GB | ~8 GB |
| 7B VLM | ~75 GB | ~20 GB |
| 72B VLM | ~128 GB+ | ~50 GB |
Next Steps
- Perception Verification - How perception checking works
- Dataset Loaders - Preparing VLM datasets
- CLI Reference - All VLM commands