Testing Guide

VLM Module Testing Guide

Step-by-step guide to testing the halo forge Vision-Language Model training module.

Prerequisites

Required Dependencies

pip install torch transformers accelerate pillow

Verification Dependencies

DependencyRequired ForInstall
ultralyticsObject detection (PerceptionChecker)pip install ultralytics
easyocrOCR verificationpip install easyocr
sentence-transformersSemantic similaritypip install sentence-transformers

Hardware Requirements

  • Minimum: 32 GB RAM, 16 GB VRAM (for 2B VLM)
  • Recommended: 64 GB RAM, 48 GB VRAM (for 7B VLM)
  • ROCm Support: AMD Radeon RX 7900 series or higher

Unit Tests

Run the full VLM test suite:

# From project root
cd /path/to/halo-forge

# Run all VLM tests
pytest tests/test_vlm.py tests/test_vlm_data.py -v

# Run specific test files
pytest tests/test_vlm.py -v        # Verifier tests
pytest tests/test_vlm_data.py -v   # Data loader tests

# Run specific test classes
pytest tests/test_vlm.py::TestPerceptionChecker -v
pytest tests/test_vlm.py::TestVisionVerifier -v
pytest tests/test_vlm_data.py::TestTextVQALoader -v

Expected Test Coverage

test_vlm.py:

Test ClassTestsDescription
TestDetection3Detection dataclass
TestPerceptionChecker4Object detection, OCR
TestPerceptionResult2Perception results
TestReasoningChecker3Chain-of-thought validation
TestReasoningStep2Reasoning step dataclass
TestOutputChecker3Answer verification
TestVisionVerifier4Multi-stage verification
TestIntegration2End-to-end workflows

test_vlm_data.py:

Test ClassTestsDescription
TestVLMSample2Sample dataclass
TestVLMDataset3Abstract dataset
TestTextVQALoader3TextVQA dataset
TestDocVQALoader3DocVQA dataset
TestChartQALoader3ChartQA dataset
TestRealWorldQALoader3RealWorldQA dataset
TestVLMPreprocessor3Image preprocessing
TestIntegration2End-to-end data loading

Dry Run Testing

Test commands without running actual training:

# Test CLI parsing and config validation
halo-forge vlm train \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --dataset textvqa \
  --cycles 3 \
  --output /tmp/vlm-test \
  --dry-run

# Expected output:
# Dry run mode enabled. Configuration validated successfully.
# Model: Qwen/Qwen2-VL-2B-Instruct
# Dataset: textvqa
# Cycles: 3
# Output: /tmp/vlm-test

Manual Testing Steps

1. Test Dataset Loaders

# List available VLM datasets
halo-forge vlm datasets

# Expected output:
# Available VLM Datasets:
#   - textvqa (45K samples)
#   - docvqa (50K samples)
#   - chartqa (28K samples)
#   - realworldqa (700 samples)
#   - mathvista (6K+ samples)

Test loading individual datasets in Python:

from halo_forge.vlm.data.loaders import (
    TextVQALoader, DocVQALoader, ChartQALoader, RealWorldQALoader
)

# Test TextVQA
loader = TextVQALoader(split="train", limit=10)
samples = loader.load()
print(f"Loaded {len(samples)} TextVQA samples")
print(f"First sample: {samples[0]}")

# Test DocVQA
loader = DocVQALoader(split="train", limit=10)
samples = loader.load()
print(f"Loaded {len(samples)} DocVQA samples")

2. Test Perception Checker

from halo_forge.vlm.verifiers.perception import PerceptionChecker
from PIL import Image

# Initialize checker (requires ultralytics, easyocr)
checker = PerceptionChecker()

# Test with a sample image
image = Image.open("test_image.jpg")
response = "I can see a red car and a stop sign."

result = checker.check(image, response)
print(f"Perception score: {result.score:.2f}")
print(f"Detected objects: {result.detected_objects}")
print(f"OCR text: {result.ocr_text}")

3. Test Reasoning Checker

from halo_forge.vlm.verifiers.reasoning import ReasoningChecker

checker = ReasoningChecker()

response = """
Let me analyze this step by step:
1. First, I see a document with a table
2. The table has columns for date and amount
3. Looking at the total row, the amount is $1,234
Therefore, the total is $1,234.
"""

result = checker.check(response)
print(f"Reasoning score: {result.score:.2f}")
print(f"Steps found: {len(result.steps)}")
print(f"Is coherent: {result.is_coherent}")

4. Test Output Checker

from halo_forge.vlm.verifiers.output import OutputChecker

checker = OutputChecker()

# Exact match
result = checker.check(
    prediction="$1,234",
    ground_truth="$1,234"
)
print(f"Exact match score: {result.score:.2f}")

# Fuzzy match
result = checker.check(
    prediction="The answer is 1234 dollars",
    ground_truth="$1,234"
)
print(f"Fuzzy match score: {result.score:.2f}")

5. Test Full VisionVerifier

from halo_forge.vlm.verifiers.base import VisionVerifier, VisionVerifyConfig
from PIL import Image

config = VisionVerifyConfig(
    perception_weight=0.3,
    reasoning_weight=0.4,
    output_weight=0.3
)
verifier = VisionVerifier(config)

image = Image.open("test_chart.png")
response = """
Looking at the chart, I can see three bars representing sales data.
The blue bar (Q1) is at 100, red bar (Q2) at 150, green bar (Q3) at 200.
The total sales across all quarters is 450.
"""
ground_truth = "450"

result = verifier.verify(image, response, ground_truth)
print(f"Overall score: {result.overall_score:.2f}")
print(f"Perception: {result.perception_score:.2f}")
print(f"Reasoning: {result.reasoning_score:.2f}")
print(f"Output: {result.output_score:.2f}")

6. Test VLM Benchmark

# Benchmark on a small subset
halo-forge vlm benchmark \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --dataset textvqa \
  --limit 20

# Expected output:
# Benchmarking Qwen/Qwen2-VL-2B-Instruct on textvqa...
# Samples: 20
# Average perception score: X.XX
# Average reasoning score: X.XX
# Average output score: X.XX
# Overall accuracy: XX.X%

Validation Checklist

After running tests, verify:

  • All unit tests pass (pytest tests/test_vlm.py tests/test_vlm_data.py)
  • Dry run mode works without errors
  • All dataset loaders work (halo-forge vlm datasets)
  • PerceptionChecker detects objects correctly
  • ReasoningChecker validates chain-of-thought
  • OutputChecker compares answers correctly
  • VisionVerifier combines all stages
  • VLM benchmark runs on small subset

Troubleshooting

“ultralytics not installed”

This is a warning. Install if you need object detection:

pip install ultralytics

“easyocr not installed”

Required for OCR verification:

pip install easyocr

Note: First run downloads language models (~100MB).

“sentence-transformers not installed”

Required for semantic similarity:

pip install sentence-transformers

Dataset Loading Errors

Most VLM datasets require HuggingFace authentication:

huggingface-cli login

Or set the token:

export HF_TOKEN=your_token_here

Image Loading Errors

Ensure images are valid and accessible:

from PIL import Image

# Test loading
try:
    img = Image.open("test.jpg")
    img.verify()
    print("Image is valid")
except Exception as e:
    print(f"Image error: {e}")

Out of Memory (OOM)

VLM models are memory-intensive:

  • Use 2B models for testing (Qwen2-VL-2B)
  • Reduce batch size with --batch-size 1
  • Use --limit to test on fewer samples
  • Enable gradient checkpointing in config

ROCm-specific Issues

For AMD GPUs with attention issues:

export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
halo-forge vlm train --experimental-attention ...

Next Steps