VLM Training

⚠️ Experimental Feature: VLM training is under active development. APIs may change.

Vision-Language Model Training

Train vision-language models (VLMs) using RLVR with perception-aware verification.

Overview

Phase 3 of halo-forge extends the RAFT training framework to support vision-language models. This enables training models like Qwen-VL and LLaVA on visual question answering tasks with graduated reward signals.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     VLM RAFT Pipeline                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐    ┌──────────┐    ┌──────────────────────┐  │
│  │  Image   │───▶│   VLM    │───▶│     Completion       │  │
│  │  Prompt  │    │  Model   │    │                      │  │
│  └──────────┘    └──────────┘    └──────────┬───────────┘  │
│                                              │               │
│                                              ▼               │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              Multi-Stage Verification                 │  │
│  ├──────────────┬──────────────┬────────────────────────┤  │
│  │  Perception  │  Reasoning   │       Output           │  │
│  │   (0.3)      │   (0.4)      │       (0.3)            │  │
│  │              │              │                        │  │
│  │ • Objects    │ • Structure  │ • Exact match          │  │
│  │ • OCR        │ • Consistency│ • Fuzzy match          │  │
│  │ • Spatial    │ • Grounding  │ • Semantic sim         │  │
│  │ • Counting   │              │                        │  │
│  └──────────────┴──────────────┴────────────────────────┘  │
│                          │                                  │
│                          ▼                                  │
│                   Combined Reward                           │
│                      (0.0 - 1.0)                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Quick Start

Train on TextVQA

halo-forge vlm train \
    --model Qwen/Qwen2-VL-7B-Instruct \
    --dataset textvqa \
    --cycles 6 \
    --output models/vlm_raft

Benchmark a Model

halo-forge vlm benchmark \
    --model models/vlm_raft/cycle_6 \
    --dataset docvqa \
    --limit 100

List Available Datasets

halo-forge vlm datasets

Supported Models

ModelAdapterNotes
Qwen/Qwen2-VL-2B-Instructqwen_vlLightweight
Qwen/Qwen2-VL-7B-Instructqwen_vlRecommended
Qwen/Qwen2-VL-72B-Instructqwen_vlRequires 128GB+
llava-hf/llava-1.5-7b-hfllavaGood baseline
llava-hf/llava-v1.6-34b-hfllavaHigh quality

Supported Datasets

DatasetTaskSize
TextVQAText reading in images45K train
DocVQADocument understanding50K train
ChartQAChart interpretation28K train
RealWorldQAReal-world reasoning700 test
MathVistaMath with visuals6K+ test

Key Features

Perception Verification

The PerceptionChecker validates visual claims:

  • Object Detection: Uses YOLOv8 to verify claimed objects exist
  • OCR Verification: Uses EasyOCR to verify text extraction
  • Spatial Reasoning: Validates “left of”, “above”, etc.
  • Counting: Verifies object counts

Reasoning Verification

The ReasoningChecker validates chain-of-thought quality:

  • Structure: Proper reasoning steps
  • Consistency: No contradictions
  • Grounding: References to visual evidence

Output Verification

The OutputChecker validates final answers:

  • Exact Match: Direct comparison
  • Fuzzy Match: Similarity-based
  • Semantic: Embedding similarity (optional)

Configuration

from halo_forge.vlm.trainer import VLMRAFTConfig, VLMRAFTTrainer

config = VLMRAFTConfig(
    model_name="Qwen/Qwen2-VL-7B-Instruct",
    num_cycles=6,
    samples_per_prompt=4,
    
    # Verification weights
    perception_weight=0.3,
    reasoning_weight=0.4,
    output_weight=0.3,
    
    # Training
    learning_rate=5e-5,
    lr_decay_per_cycle=0.85,
    temperature=0.7,
)

trainer = VLMRAFTTrainer(config)
trainer.train("textvqa")

Memory Requirements

Model SizeTraining MemoryInference Memory
2B VLM~40 GB~8 GB
7B VLM~75 GB~20 GB
72B VLM~128 GB+~50 GB

Next Steps