Experimental Features

Features under active development: VLM, Audio, Reasoning, Agentic, Inference

This section contains features that extend halo-forge beyond code generation training into new domains.

Status: These modalities support real training runs, but readiness is tracked by internal qualification tiers rather than broad “stable” claims.

Modality Training Capability Matrix

ModalityRuntime StatusQualification TierGateSupported Model Families
VLMReal training pathproduction_ready only after deterministic train+eval qualification passesOptional compatibility flag if temporarily re-gatedqwen2-vl, qwen-vl, llava
AudioReal training pathproduction_ready only after deterministic train+eval qualification passesOptional compatibility flag if temporarily re-gatedwhisper
ReasoningReal training pathproduction_ready only after deterministic train+eval qualification passesOptional compatibility flag if temporarily re-gatedqwen2.5, qwen2, qwen, llama-3, llama3, mistral
AgenticReal training pathproduction_ready only after deterministic train+eval qualification passesOptional compatibility flag if temporarily re-gatedqwen2.5, qwen2, qwen, llama-3, llama3, mistral

All modality train commands also support --resume-from-cycle and persist canonical artifacts: cycle_<n>/model, cycle_<n>/checkpoint_state.json, latest_checkpoint.json, final_model, and training_summary.json. If a modality is temporarily re-gated to prototype, add --allow-prototype-train as a compatibility override.

Readiness tiers used internally and surfaced in the UI:

  • experimental: launch path exists, but qualification evidence is incomplete
  • qualified: contract and artifact checks pass, but the full train+eval gate is still pending
  • production_ready: deterministic train+eval qualification passed, including resume/relaunch and baseline checks

Vision-Language (VLM)

Train vision-language models using RLVR with perception-aware verification.

ComponentStatusDescription
VLMRAFTTrainerBetaRAFT training for VLMs
VisionVerifierBetaMulti-stage verification
PerceptionCheckerBetaYOLOv8 + EasyOCR

Quick Start

# Full pipeline
halo-forge vlm sft --dataset llava --model Qwen/Qwen2-VL-2B-Instruct --output models/vlm_sft
halo-forge vlm train --model models/vlm_sft --dataset textvqa --cycles 6 --output models/vlm_raft --seed 42
halo-forge vlm benchmark --model models/vlm_raft --dataset docvqa --limit 100

# Quick RAFT (skip SFT)
halo-forge vlm train --model Qwen/Qwen2-VL-7B-Instruct --dataset textvqa --cycles 6 --seed 42

Supported Models

ModelAdapterNotes
Qwen/Qwen2-VL-2B-Instructqwen_vlLightweight
Qwen/Qwen2-VL-7B-Instructqwen_vlRecommended
llava-hf/llava-1.5-7b-hfllavaGood baseline

Datasets

DatasetTaskSize
TextVQAText reading in images45K train
DocVQADocument understanding50K train
ChartQAChart interpretation28K train
RealWorldQAReal-world reasoning700 test

Verification Architecture

The VisionVerifier uses multi-stage verification:

  • Perception (0.3): Object detection, OCR, spatial reasoning
  • Reasoning (0.4): Structure, consistency, grounding
  • Output (0.3): Exact match, fuzzy match, semantic similarity

Audio-Language

Train audio models (ASR, TTS, Classification) using RLVR with task-specific verification.

ComponentStatusDescription
AudioRAFTTrainerBetaRAFT for audio models
AudioVerifierBetaMulti-task verification
ASRCheckerBetaWord Error Rate (WER)

Quick Start

# Full pipeline
halo-forge audio sft --dataset librispeech_sft --model openai/whisper-small --output models/audio_sft
halo-forge audio train --model models/audio_sft --dataset librispeech --task asr --cycles 4 --seed 42
halo-forge audio benchmark --model models/audio_raft --dataset librispeech --limit 100

# Quick RAFT
halo-forge audio train --model openai/whisper-small --dataset librispeech --task asr --cycles 4 --seed 42

Supported Models

ModelTaskNotes
openai/whisper-tinyASRFast, lightweight
openai/whisper-smallASRRecommended
openai/whisper-mediumASRBetter accuracy
openai/whisper-large-v3ASRBest quality

Reward Structure (ASR)

WERReward
0%1.0 (perfect)
10%0.9
30%0.7
50%0.5

Reasoning and Math

Train models on mathematical reasoning with SymPy-based verification.

ComponentStatusDescription
ReasoningRAFTTrainerBetaRAFT for reasoning
MathVerifierBetaSymPy symbolic evaluation

Quick Start

# Full pipeline
halo-forge reasoning sft --dataset metamath --model Qwen/Qwen2.5-3B-Instruct --output models/reasoning_sft
halo-forge reasoning train --model models/reasoning_sft --dataset gsm8k --cycles 4 --seed 42
halo-forge reasoning benchmark --model models/reasoning_raft --dataset gsm8k --limit 100

# Quick RAFT
halo-forge reasoning train --model Qwen/Qwen2.5-7B-Instruct --dataset gsm8k --cycles 4 --seed 42

Datasets

DatasetTaskDescription
GSM8KGrade school mathWord problems
MATHCompetition mathHarder problems
MetaMathQASFT dataLarge scale training

Reward Structure

OutcomeReward
Correct answer1.0
Wrong + showed work0.2
No answer + work0.2
No answer, no work0.1

Agentic / Tool Calling

Train models for reliable function/tool calling with schema-aware verification.

ComponentStatusDescription
AgenticRAFTTrainerBetaRAFT for tool calling
ToolCallingVerifierBetaJSON schema validation
HermesFormatterBetaHermes chat format

Quick Start

# Full pipeline
halo-forge agentic sft --dataset xlam_sft --model Qwen/Qwen2.5-7B-Instruct --output models/agentic_sft
halo-forge agentic train --model models/agentic_sft --dataset xlam --cycles 5 --seed 42
halo-forge agentic benchmark --model models/agentic_raft --dataset xlam --limit 100

# Quick RAFT
halo-forge agentic train --model Qwen/Qwen2.5-7B-Instruct --dataset xlam --cycles 5 --seed 42

Reward Structure

OutcomeReward
Correct function + args1.0
Correct function, wrong args0.5
Valid JSON, wrong function0.25
No tool call when expected0.0
Called when shouldn’t-0.25

Data Format

halo-forge uses the Hermes format, standard for Qwen2.5 and NousHermes:

<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris"}}
</tool_call><|im_end|>

Inference Optimization

Optimize trained models for deployment with quantization and export.

FeatureDescriptionCommand
QuantizationINT4/INT8 reductionhalo-forge inference optimize
GGUF ExportFor llama.cpp/Ollamahalo-forge inference export --format gguf
ONNX ExportCross-platformhalo-forge inference export --format onnx
BenchmarkingMeasure latencyhalo-forge inference benchmark

Quick Start

# Optimize
halo-forge inference optimize \
  --model models/raft/final \
  --target-precision int4 \
  --output models/optimized

# Export to GGUF
halo-forge inference export \
  --model models/optimized \
  --format gguf \
  --quantization Q4_K_M \
  --output model.gguf

# Benchmark
halo-forge inference benchmark --model models/optimized --num-prompts 20

Quantization Options

PrecisionSize ReductionQuality
int4~75%Good
int8~50%Better
fp16~50%Best

GGUF Quantization Types

TypeDescription
Q4_K_MRecommended balance
Q4_K_SSmaller, lower quality
Q8_0Highest quality 8-bit
F16No quantization

Readiness Levels

LevelMeaning
experimentalEarly or partial qualification evidence; useful for research iteration
qualifiedContract and artifact checks pass, but the full train+eval gate is not yet complete
production_readyDeterministic train+eval qualification passed with baseline/tolerance checks

Feedback

If you encounter issues or have suggestions:

  1. Check the Troubleshooting guide
  2. Review the Changelog for recent changes
  3. Open an issue on GitHub with reproduction steps