VLM Datasets

VLM Datasets

Guide to obtaining and using datasets for Vision-Language Model training.

Available Datasets

List available VLM datasets:

halo-forge vlm datasets
DatasetHuggingFace PathTaskSize
textvqatextvqaText reading in images45K train
docvqalmms-lab/DocVQADocument understanding50K train
chartqaHuggingFaceM4/ChartQAChart interpretation28K train
realworldqalmms-lab/RealWorldQAReal-world reasoning700 test
mathvistaAI4Math/MathVistaMathematical reasoning6K+ test

Using Built-in Loaders

Load from CLI

# Benchmark on TextVQA
halo-forge vlm benchmark \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --dataset textvqa \
  --limit 100

# Train on DocVQA
halo-forge vlm train \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --dataset docvqa \
  --cycles 4 \
  --output models/vlm_raft

Load Programmatically

from halo_forge.vlm.data import load_vlm_dataset, list_vlm_datasets

# List available datasets
print(list_vlm_datasets())
# ['textvqa', 'docvqa', 'chartqa', 'realworldqa', 'mathvista']

# Load TextVQA
dataset = load_vlm_dataset("textvqa", split="train", limit=1000)

# Iterate samples
for sample in dataset:
    print(f"Prompt: {sample.prompt}")
    print(f"Ground truth: {sample.ground_truth}")
    image = sample.load_image()  # PIL Image

VLMSample Format

Each sample contains:

@dataclass
class VLMSample:
    prompt: str           # Question/instruction
    image: str | Image    # Path, URL, or PIL Image
    ground_truth: str     # Expected answer
    metadata: Dict        # Additional info

Example Sample

sample = VLMSample(
    prompt="What text is shown on the sign?",
    image="path/to/image.jpg",
    ground_truth="STOP",
    metadata={"source": "textvqa", "difficulty": "easy"}
)

Dataset Loaders

TextVQA

Text reading in natural images (signs, labels, documents).

from halo_forge.vlm.data.loaders import TextVQALoader

loader = TextVQALoader(split="train", limit=500)
samples = loader.load()

# Sample prompt: "What does the sign say?"
# Sample answer: "EXIT"

DocVQA

Document understanding and information extraction.

from halo_forge.vlm.data.loaders import DocVQALoader

loader = DocVQALoader(split="train", limit=500)
samples = loader.load()

# Sample prompt: "What is the total amount due?"
# Sample answer: "$1,234.56"

ChartQA

Chart and graph interpretation.

from halo_forge.vlm.data.loaders import ChartQALoader

loader = ChartQALoader(split="train", limit=500)
samples = loader.load()

# Sample prompt: "What is the value for Q3?"
# Sample answer: "150"

RealWorldQA

Real-world reasoning from images.

from halo_forge.vlm.data.loaders import RealWorldQALoader

loader = RealWorldQALoader(limit=200)
samples = loader.load()

# Sample prompt: "How many people are in the image?"
# Sample answer: "3"

MathVista

Mathematical reasoning with visual context.

from halo_forge.vlm.data.loaders import MathVistaLoader

loader = MathVistaLoader(split="test", limit=100)
samples = loader.load()

# Sample prompt: "What is the area of the shaded region?"
# Sample answer: "25 square units"

Export Formats

Export to RLVR Format

dataset = load_vlm_dataset("textvqa", limit=1000)

# Export to JSONL
dataset.to_rlvr_format("vlm_rlvr.jsonl")

Output format:

{
  "prompt": "What text is shown on the sign?",
  "image": "/path/to/image.jpg",
  "ground_truth": "STOP",
  "metadata": {"source": "textvqa"}
}

Export to SFT Format

dataset = load_vlm_dataset("docvqa", limit=1000)

# Export with Qwen template
dataset.to_sft_format("vlm_sft.jsonl", template="qwen")

Output format:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "<image>\nWhat is the total amount?"},
    {"role": "assistant", "content": "The total amount is $1,234.56"}
  ],
  "images": ["/path/to/image.jpg"]
}

Creating Custom VLM Datasets

From Local Images

from halo_forge.vlm.data import VLMSample, VLMDataset
from pathlib import Path
import json

class CustomVLMDataset(VLMDataset):
    @property
    def name(self) -> str:
        return "custom"
    
    def load(self):
        # Load from annotation file
        with open("annotations.json") as f:
            data = json.load(f)
        
        self.samples = [
            VLMSample(
                prompt=item["question"],
                image=f"images/{item['image_id']}.jpg",
                ground_truth=item["answer"],
                metadata={"id": item["id"]}
            )
            for item in data
        ]
        return self.samples

# Use
dataset = CustomVLMDataset()
dataset.load()
dataset.to_rlvr_format("custom_vlm.jsonl")

From HuggingFace

from datasets import load_dataset
from halo_forge.vlm.data import VLMSample
import json

# Load any VQA dataset
hf_dataset = load_dataset("flaviagiammarino/vqa-rad", split="train")

# Convert to VLMSample format
samples = []
for item in hf_dataset:
    samples.append({
        "prompt": item["question"],
        "image": item["image"],  # PIL Image
        "ground_truth": item["answer"],
        "metadata": {"source": "vqa-rad"}
    })

# Save
with open("vqa_rad.jsonl", "w") as f:
    for s in samples:
        # Note: images need to be saved separately
        f.write(json.dumps({
            "prompt": s["prompt"],
            "ground_truth": s["ground_truth"],
            "metadata": s["metadata"]
        }) + "\n")

HuggingFace Sources

DatasetHuggingFace PathDescription
TextVQAtextvqaText reading in images
DocVQAlmms-lab/DocVQADocument QA
ChartQAHuggingFaceM4/ChartQAChart understanding
VQA v2HuggingFaceM4/VQAv2General visual QA
OK-VQAMultimodal-Fatima/OK-VQA_trainKnowledge-based VQA
GQAlmms-lab/GQACompositional reasoning
AI2Dlmms-lab/ai2dScience diagrams
InfoVQAlmms-lab/infographicvqaInfographic QA

Loading Directly

from datasets import load_dataset

# Load VQA v2
vqa = load_dataset("HuggingFaceM4/VQAv2", split="train")

# Access samples
for item in vqa:
    question = item["question"]
    image = item["image"]  # PIL Image
    answers = item["answers"]

Image Preprocessing

VLMPreprocessor

from halo_forge.vlm.data import VLMPreprocessor

processor = VLMPreprocessor(
    target_size=(224, 224),
    normalize=True
)

# Process image
result = processor.process("image.jpg")
pixel_values = result.pixel_values  # Tensor

Custom Preprocessing

from PIL import Image
import torchvision.transforms as T

transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

image = Image.open("image.jpg")
tensor = transform(image)

Memory Considerations

VLM datasets can be memory-intensive:

DatasetImagesTypical Memory
TextVQA45K~20 GB disk
DocVQA50K~30 GB disk
ChartQA28K~15 GB disk

Tips

  1. Use limit parameter for testing
  2. Stream images instead of loading all at once
  3. Use smaller image sizes for development
  4. Clear cache: rm -rf ~/.cache/halo_forge/vlm

Next Steps