VLM Verifiers
Vision-Language Verifiers
Multi-stage verification for VLM outputs with perception-aware rewards.
VisionVerifier
The main verifier that combines all stages:
from halo_forge.vlm.verifiers import VisionVerifier
verifier = VisionVerifier(
perception_weight=0.3,
reasoning_weight=0.4,
output_weight=0.3,
)
result = verifier.verify(
image=image,
prompt="What text is on the sign?",
completion="The sign says 'STOP'.",
ground_truth="STOP"
)
print(f"Reward: {result.reward}")
print(f"Success: {result.success}")
Perception Checker
Validates visual claims using object detection and OCR.
Object Detection
Uses YOLOv8 to verify objects claimed in the completion:
from halo_forge.vlm.verifiers import PerceptionChecker
checker = PerceptionChecker(
detector_model="yolov8n", # yolov8n, yolov8s, yolov8m
confidence_threshold=0.25,
use_ocr=True
)
result = checker.verify(image, completion)
print(f"Object score: {result.object_score}")
print(f"Text score: {result.text_score}")
print(f"Spatial score: {result.spatial_score}")
Claim Extraction
The checker extracts claims from completions:
- Object claims: “I see a cat”, “there is a dog”
- Text claims: Quoted text, “says X”, “reads Y”
- Counting claims: “three cats”, “5 people”
- Spatial claims: “the dog is left of the cat”
Verification Process
- Run object detection on image
- Run OCR on image
- Extract claims from completion
- Match claims against detections
- Calculate per-category scores
Reasoning Checker
Validates chain-of-thought quality.
from halo_forge.vlm.verifiers import ReasoningChecker
checker = ReasoningChecker(
min_steps=2,
require_evidence=True,
require_conclusion=True
)
result = checker.verify(completion)
print(f"Structure: {result.structure_score}")
print(f"Consistency: {result.consistency_score}")
print(f"Grounding: {result.grounding_score}")
What It Checks
| Aspect | What It Looks For | Weight |
|---|---|---|
| Structure | Numbered steps, logical progression | 0.3 |
| Consistency | No contradictions, logical flow | 0.3 |
| Grounding | References to image evidence | 0.4 |
Evidence Patterns
The checker looks for phrases that reference visual evidence:
- “I can see…”
- “Looking at the image…”
- “The picture shows…”
- “Based on the image…”
Output Checker
Validates final answer accuracy.
from halo_forge.vlm.verifiers import OutputChecker
checker = OutputChecker(
fuzzy_threshold=0.8,
use_semantic=False,
normalize_answers=True
)
result = checker.verify(
completion="The answer is STOP",
ground_truth="stop"
)
print(f"Exact match: {result.exact_match}")
print(f"Fuzzy score: {result.fuzzy_score}")
Match Types
| Type | Method | Score |
|---|---|---|
| Exact | Normalized string comparison | 1.0 |
| Fuzzy | SequenceMatcher ratio | 0.0-1.0 |
| Semantic | Embedding similarity | 0.0-1.0 |
Answer Formats
Supports common VQA answer formats:
- Yes/No answers
- Numbers
- Multiple choice (A/B/C/D)
- Short answers
- Long answers
Specialized Verifiers
VQAVerifier
Optimized for Visual Question Answering:
from halo_forge.vlm.verifiers.base import VQAVerifier
verifier = VQAVerifier() # Output weight = 0.6
DocVQAVerifier
Optimized for document understanding:
from halo_forge.vlm.verifiers.base import DocVQAVerifier
verifier = DocVQAVerifier() # Perception weight = 0.4, OCR enabled
ChartQAVerifier
Optimized for chart interpretation:
from halo_forge.vlm.verifiers.base import ChartQAVerifier
verifier = ChartQAVerifier() # Balanced weights
Reward Calculation
The final reward is a weighted combination:
reward = (perception_weight × perception_score) +
(reasoning_weight × reasoning_score) +
(output_weight × output_score)
If no ground truth is provided, weights are redistributed:
# Without ground truth
reward = (perception × perception_score + reasoning × reasoning_score) /
(perception + reasoning)
Dependencies
Install perception verification dependencies:
pip install ultralytics easyocr
For semantic matching:
pip install sentence-transformers