Experimental Features

Features under active development: VLM, Audio, Reasoning, Agentic, Inference

This section contains features that are currently in development and testing. These modules extend halo-forge beyond code generation training into new domains.

Status: These features are functional but may change significantly as we iterate.

Modality Training Capability Matrix

ModalityRuntime StatusGateSupported Model Families
VLMReal trainingOptional compatibility flag if temporarily re-gatedqwen2-vl, qwen-vl, llava
AudioReal trainingOptional compatibility flag if temporarily re-gatedwhisper
ReasoningReal trainingOptional compatibility flag if temporarily re-gatedqwen2.5, qwen2, qwen, llama-3, llama3, mistral
AgenticReal trainingOptional compatibility flag if temporarily re-gatedqwen2.5, qwen2, qwen, llama-3, llama3, mistral

All modality train commands also support --resume-from-cycle and persist canonical artifacts: cycle_<n>/model, cycle_<n>/checkpoint_state.json, latest_checkpoint.json, final_model, and training_summary.json. If a modality is temporarily re-gated to prototype, add --allow-prototype-train as a compatibility override.


Vision-Language (VLM)

Train vision-language models using RLVR with perception-aware verification.

ComponentStatusDescription
VLMRAFTTrainerBetaRAFT training for VLMs
VisionVerifierBetaMulti-stage verification
PerceptionCheckerBetaYOLOv8 + EasyOCR

Quick Start

# Full pipeline
halo-forge vlm sft --dataset llava --model Qwen/Qwen2-VL-2B-Instruct --output models/vlm_sft
halo-forge vlm train --model models/vlm_sft --dataset textvqa --cycles 6 --output models/vlm_raft --seed 42
halo-forge vlm benchmark --model models/vlm_raft --dataset docvqa --limit 100

# Quick RAFT (skip SFT)
halo-forge vlm train --model Qwen/Qwen2-VL-7B-Instruct --dataset textvqa --cycles 6 --seed 42

Supported Models

ModelAdapterNotes
Qwen/Qwen2-VL-2B-Instructqwen_vlLightweight
Qwen/Qwen2-VL-7B-Instructqwen_vlRecommended
llava-hf/llava-1.5-7b-hfllavaGood baseline

Datasets

DatasetTaskSize
TextVQAText reading in images45K train
DocVQADocument understanding50K train
ChartQAChart interpretation28K train
RealWorldQAReal-world reasoning700 test

Verification Architecture

The VisionVerifier uses multi-stage verification:

  • Perception (0.3): Object detection, OCR, spatial reasoning
  • Reasoning (0.4): Structure, consistency, grounding
  • Output (0.3): Exact match, fuzzy match, semantic similarity

Audio-Language

Train audio models (ASR, TTS, Classification) using RLVR with task-specific verification.

ComponentStatusDescription
AudioRAFTTrainerBetaRAFT for audio models
AudioVerifierBetaMulti-task verification
ASRCheckerBetaWord Error Rate (WER)

Quick Start

# Full pipeline
halo-forge audio sft --dataset librispeech_sft --model openai/whisper-small --output models/audio_sft
halo-forge audio train --model models/audio_sft --dataset librispeech --task asr --cycles 4 --seed 42
halo-forge audio benchmark --model models/audio_raft --dataset librispeech --limit 100

# Quick RAFT
halo-forge audio train --model openai/whisper-small --dataset librispeech --task asr --cycles 4 --seed 42

Supported Models

ModelTaskNotes
openai/whisper-tinyASRFast, lightweight
openai/whisper-smallASRRecommended
openai/whisper-mediumASRBetter accuracy
openai/whisper-large-v3ASRBest quality

Reward Structure (ASR)

WERReward
0%1.0 (perfect)
10%0.9
30%0.7
50%0.5

Reasoning and Math

Train models on mathematical reasoning with SymPy-based verification.

ComponentStatusDescription
ReasoningRAFTTrainerBetaRAFT for reasoning
MathVerifierBetaSymPy symbolic evaluation

Quick Start

# Full pipeline
halo-forge reasoning sft --dataset metamath --model Qwen/Qwen2.5-3B-Instruct --output models/reasoning_sft
halo-forge reasoning train --model models/reasoning_sft --dataset gsm8k --cycles 4 --seed 42
halo-forge reasoning benchmark --model models/reasoning_raft --dataset gsm8k --limit 100

# Quick RAFT
halo-forge reasoning train --model Qwen/Qwen2.5-7B-Instruct --dataset gsm8k --cycles 4 --seed 42

Datasets

DatasetTaskDescription
GSM8KGrade school mathWord problems
MATHCompetition mathHarder problems
MetaMathQASFT dataLarge scale training

Reward Structure

OutcomeReward
Correct answer1.0
Wrong + showed work0.2
No answer + work0.2
No answer, no work0.1

Agentic / Tool Calling

Train models for reliable function/tool calling with schema-aware verification.

ComponentStatusDescription
AgenticRAFTTrainerBetaRAFT for tool calling
ToolCallingVerifierBetaJSON schema validation
HermesFormatterBetaHermes chat format

Quick Start

# Full pipeline
halo-forge agentic sft --dataset xlam_sft --model Qwen/Qwen2.5-7B-Instruct --output models/agentic_sft
halo-forge agentic train --model models/agentic_sft --dataset xlam --cycles 5 --seed 42
halo-forge agentic benchmark --model models/agentic_raft --dataset xlam --limit 100

# Quick RAFT
halo-forge agentic train --model Qwen/Qwen2.5-7B-Instruct --dataset xlam --cycles 5 --seed 42

Reward Structure

OutcomeReward
Correct function + args1.0
Correct function, wrong args0.5
Valid JSON, wrong function0.25
No tool call when expected0.0
Called when shouldn’t-0.25

Data Format

halo-forge uses the Hermes format, standard for Qwen2.5 and NousHermes:

<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris"}}
</tool_call><|im_end|>

Inference Optimization

Optimize trained models for deployment with quantization and export.

FeatureDescriptionCommand
QuantizationINT4/INT8 reductionhalo-forge inference optimize
GGUF ExportFor llama.cpp/Ollamahalo-forge inference export --format gguf
ONNX ExportCross-platformhalo-forge inference export --format onnx
BenchmarkingMeasure latencyhalo-forge inference benchmark

Quick Start

# Optimize
halo-forge inference optimize \
  --model models/raft/final \
  --target-precision int4 \
  --output models/optimized

# Export to GGUF
halo-forge inference export \
  --model models/optimized \
  --format gguf \
  --quantization Q4_K_M \
  --output model.gguf

# Benchmark
halo-forge inference benchmark --model models/optimized --num-prompts 20

Quantization Options

PrecisionSize ReductionQuality
int4~75%Good
int8~50%Better
fp16~50%Best

GGUF Quantization Types

TypeDescription
Q4_K_MRecommended balance
Q4_K_SSmaller, lower quality
Q8_0Highest quality 8-bit
F16No quantization

Stability Levels

LevelMeaning
AlphaEarly development, API may change significantly
BetaFeature complete, API mostly stable
StableProduction ready, in main documentation

Feedback

If you encounter issues or have suggestions:

  1. Check the Troubleshooting guide
  2. Review the Changelog for recent changes
  3. Open an issue on GitHub with reproduction steps