Changelog
All notable changes to halo forge
[1.3.0] - 2026-01-21
Added
- Web UI Verifier Integration - Verifier test page now calls real backend verifiers instead of returning hardcoded results
- Branding - Halo-forge favicon and sidebar logo integrated into web UI
- Static Asset Serving - UI properly serves static files from
ui/static/ - SFT
--no-gradient-checkpointingCLI flag - Control gradient checkpointing from UI and CLI
Fixed
- SFT Dataset Routing - Local
.jsonlfiles now correctly use--dataflag; HuggingFace IDs use--dataset - RAFT Verifier Alignment - UI verifier choices now match CLI
--verifieroptions exactly - MBPP Verifier - Natural language prompts no longer cause syntax errors during execution
Changed
- Removed unused RAFT learning rate UI field (CLI uses lr-decay schedule, not initial LR)
[1.2.0] - 2026-01-10
Added
Auto-Logging
- Automatic log capture - All training and benchmark commands now automatically log to
logs/with timestamped filenames --quietflag - Suppress terminal output while still writing to log file- No more need for manual
teeorPYTHONUNBUFFERED=1
New RAFT CLI Flags
--samples-per-prompt- Control samples generated per prompt (default: 8)--temperature- Set generation temperature (default: 0.7)--max-new-tokens- Limit generation length (default: 1024)--min-samples- Auto-adjust threshold if too few samples pass filtering
Preset Config Files
configs/raft_conservative.yaml- Safe training: 80% keep, slow LR decay, min 200 samplesconfigs/raft_aggressive.yaml- Strict filtering: 30% keep, 16 samples/prompt, 0.8 tempconfigs/vlm_example.yaml- VLM RAFT with perception/reasoning/output weightsconfigs/audio_example.yaml- Audio RAFT for ASR/TTSconfigs/reasoning_example.yaml- Math/reasoning RAFT
Module Consistency
- Added missing flags to all domain modules (VLM, Audio, Reasoning, Agentic)
- Consistent
--samples-per-prompt,--temperature,--keep-percent,--reward-thresholdacross alltraincommands
Changed
- Added
humaneval,mbpp,pythonto verifier choices in CLI - Improved base model loading for LoRA checkpoints (reads from
adapter_config.json) - Fixed code extraction to strip input tokens from generated completions
[1.1.0] - 2026-01-08
Added
Unified SFT Pipeline
halo-forge sft train --dataset- Load HuggingFace datasets directlyhalo-forge sft datasets- List all available SFT datasets- Domain-specific SFT commands for all modules:
halo-forge vlm sfthalo-forge audio sfthalo-forge reasoning sfthalo-forge agentic sft
SFT Datasets Module
- New
halo_forge/sft/datasets.pywith dataset registry - Short name support (e.g.,
codealpaca,metamath,llava) - Auto-formatting to ChatML for HuggingFace datasets
--max-samplesflag to limit dataset size--dry-runfor validation
Supported SFT Datasets
| Domain | Dataset | HuggingFace ID | Size |
|---|---|---|---|
| Code | codealpaca | sahil2801/CodeAlpaca-20k | 20K |
| Code | code_instructions_122k | TokenBender/code_instructions_122k | 122K |
| Reasoning | metamath | meta-math/MetaMathQA | 395K |
| Reasoning | gsm8k_sft | gsm8k | 8.5K |
| VLM | llava | liuhaotian/LLaVA-Instruct-150K | 150K |
| Audio | librispeech_sft | librispeech_asr | 100h |
| Agentic | xlam_sft | Salesforce/xlam-function-calling-60k | 60K |
| Agentic | glaive_sft | glaiveai/glaive-function-calling-v2 | 113K |
Agentic / Tool Calling Training (Phase 6)
- New
halo_forge/agentic/module for tool calling RLVR training AgenticRAFTTrainerfor RAFT training on function calling- Hermes format support (Qwen2.5, NousHermes compatible)
- TensorBoard integration via MetricsTracker
Tool Calling Verifier
ToolCallingVerifierwith graduated reward structure- JSON validation and schema compliance
- Function name and argument matching
- Irrelevance detection (penalizes false positives)
- Support for parallel and multi-turn tool calls
Tool Calling Dataset Loaders
XLAMLoader- 60k verified samples, 3,673 APIsGlaiveLoader- 113k samples with irrelevance detectionHermesFormatterfor converting to standard format
CLI Commands
halo-forge agentic train- Train tool calling with RAFThalo-forge agentic benchmark- Benchmark on tool callinghalo-forge agentic datasets- List available datasets
Improved
- Consistent SFT → RAFT → Benchmark pipeline for ALL modules
- Consistent CLI banner and colors across all modules
- MetricsTracker integration for TensorBoard logging
- 32 new unit tests for agentic module
[1.0.0] - 2026-01-08
Added
Audio Training (Phase 4)
- New
halo_forge/audio/module for audio-language RLVR training AudioRAFTTrainerfor RAFT training on audio models- Multi-task verification: ASR, TTS, Audio Classification
Audio Verifiers
AudioVerifierbase class inheriting from coreVerifierASRCheckerfor speech-to-text with WER/CER metricsTTSCheckerfor text-to-speech quality (UTMOS-based)AudioClassificationCheckerfor sound event detection
Audio Model Adapters
WhisperAdapterfor OpenAI Whisper modelsWav2VecAdapterfor wav2vec2 models- Automatic dtype handling and attention mask generation
Audio Dataset Loaders
LibriSpeechLoader- 960h clean audiobook speechCommonVoiceLoader- Multilingual crowdsourced audioAudioSetLoader- 5M clips for classificationSpeechCommandsLoader- Keyword spotting dataset
Math/Reasoning Training (Phase 5)
- New
halo_forge/reasoning/module for mathematical reasoning ReasoningRAFTTrainerfor reasoning task training- SymPy-based answer verification
Reasoning Verifiers
ReasoningVerifierbase class inheriting from coreVerifierMathVerifierwith numeric and symbolic comparisonAnswerExtractorfor parsing answers from completions- Support for
\boxed{}, “The answer is”, and numeric formats - Partial credit for showing reasoning steps
Reasoning Dataset Loaders
GSM8KLoader- 8.5K grade school math problemsMATHLoader- 12.5K competition math problems- Support for difficulty levels and subject filtering
CLI Commands
halo-forge audio train- Train audio models with RAFThalo-forge audio benchmark- Benchmark on audio datasetshalo-forge audio datasets- List audio datasetshalo-forge reasoning train- Train on math datasetshalo-forge reasoning benchmark- Math benchmarkinghalo-forge reasoning datasets- List reasoning datasets
Architecture Improvements
- All verifiers now inherit from base
Verifierclass - Consistent
verify() -> VerifyResultinterface across domains - Unified
VerifyResultdataclass withsuccess,reward,error
Changed
- Updated all containers to v1.0.0
- Removed
torchcodecdependency (using torchaudio/librosa directly) - Improved audio loading with graceful fallback to librosa
- Consistent CLI banner and colors across all commands
Fixed
- CLI subcommand dispatch issue causing empty output
- Build script argument parsing for
--tagoption - Whisper dtype mismatch causing float/half errors
- VLM preprocessor returning 4D tensors instead of 3D
[0.5.0] - 2026-01-07
Added
Vision-Language Model Training (Phase 3)
- New
halo_forge/vlm/module for VLM RLVR training VLMRAFTTrainerfor RAFT training on VLMs- Multi-stage verification pipeline for VLM outputs
VLM Verifiers
VisionVerifiercombining perception, reasoning, and output verificationPerceptionCheckerwith YOLOv8 object detection and EasyOCRReasoningCheckerfor chain-of-thought validationOutputCheckerfor answer matching (exact, fuzzy, semantic)- Specialized verifiers:
VQAVerifier,DocVQAVerifier,ChartQAVerifier
VLM Model Adapters
QwenVLAdapterfor Qwen-VL and Qwen2-VL modelsLLaVAAdapterfor LLaVA model familyGenericVLMAdapterfor other HuggingFace VLMs- Auto-detection of appropriate adapter from model name
VLM Dataset Loaders
TextVQALoader- Text reading in natural imagesDocVQALoader- Document understandingChartQALoader- Chart interpretationRealWorldQALoader- Real-world reasoningMathVistaLoader- Mathematical reasoning with visuals- Export to RLVR and SFT formats
Image Processing
VLMPreprocessorfor generic image preprocessingQwenVLProcessorfor Qwen-VL modelsLLaVAProcessorfor LLaVA models
CLI Commands
halo-forge vlm train- Train VLM with RAFThalo-forge vlm benchmark- Benchmark VLM on datasetshalo-forge vlm datasets- List available VLM datasets
Changed
- Updated changelog with Phase 3 features
- Added VLM documentation pages to website
[0.4.0] - 2026-01-06
Added
Inference Optimization Mode
- New
halo_forge/inference/module for model optimization InferenceOptimizationVerifierfor quality verificationInferenceOptimizerfor end-to-end optimization pipelineQATTrainerfor quantization-aware training
Model Export
GGUFExporterfor llama.cpp/Ollama deploymentONNXExporterfor cross-platform inference- Support for Q4_K_M, Q8_0, F16 quantization types
CLI Commands
halo-forge inference optimize- Optimize for deploymenthalo-forge inference export- Export to GGUF/ONNXhalo-forge inference benchmark- Measure latency
Calibration
CalibrationDatasetfor calibration data handling- Support for synthetic calibration data generation
Changed
- Updated CLI reference with inference commands
- Added inference section to website documentation
[0.3.0] - 2026-01-06
Added
Learning Rate Decay
--lr-decayflag for exponential LR decay across RAFT cycles (default: 0.85)--min-lrflag to set learning rate floor (default: 1e-6)- Prevents training degradation at cycles 7-8
Execution Verifier
- New
ExecutionVerifierfor test case-based verification - Supports multiple test cases with input/output pairs
- Graduated rewards: 0.5 + 0.5 × pass_rate
- Match modes: exact, contains, regex, numeric
- Pre-configured variants:
GCCExecutionVerifier,ClangExecutionVerifier,MinGWExecutionVerifier
Multi-Language Support
- New
MultiLanguageVerifierwith auto-detection - Detects: C++, C, Python, Rust, Go, C#, PowerShell
- Use
--verifier autofor automatic language detection AutoVerifieralias for CLI convenience
New Verifiers
RustVerifierwith Windows cross-compilation supportGoVerifierwith Windows cross-compilation supportDotNetVerifierfor C# compilation to Windows PEPowerShellVerifierfor script syntax validation
Metrics Tracking
MetricsTrackerwith TensorBoard integration- JSON logging for all cycle metrics
TrainingMonitorfor early stopping detection- Automatic
metrics.jsonlgeneration
Dataset Loaders
HumanEvalPlusLoader- 80x more test cases per problemLiveCodeBenchLoader- Contamination-free benchmark
CLI Enhancements
halo-forge config validatecommand--system-promptflag for custom prompts- MSVC verifier validation with helpful error messages
Changed
- Default system prompt updated to “You are an expert Windows systems programmer”
- Improved PEFT adapter handling to prevent stacking
- Category tracking now supports root-level fields in datasets
Fixed
- PEFT adapter stacking bug in
_reload_model() - “Unknown” category issue in benchmark results
- MSVC verifier parameter validation
[0.2.0] - 2025-01-01
Added
halo-forge testcommand for pipeline validation--level smoke: Quick imports/compiler check (no GPU)--level standard: Model loading, generation, verification--level full: Complete mini-RAFT cycle with training
halo-forge benchmark fullcommand for comprehensive benchmarks- Graduated rewards (
RewardLevel) for partial credit - Runtime verification (
run_after_compile) for compile verifiers - Comprehensive verifier unit tests
- Chunked verification in RAFT trainer to prevent OOM
Changed
- Optimized for BF16 (4-bit quantization removed from defaults)
- Updated all docs to reflect 128GB unified memory
- Improved error messages in verifiers
- SFT trainer now uses
device_map="auto"
Fixed
- Memory leak during RAFT verification
- Gradient checkpointing warning during benchmark training
[0.1.0] - 2024-12-28
Added
- Initial release
- Custom toolbox with ROCm 7 nightly for gfx1151
- Data generation module (public datasets + LLM generation)
- SFT training with LoRA/BF16 support
- RAFT training with pluggable verifiers
- Benchmarking with pass@k metrics
- Built-in verifiers: GCC, Clang, MinGW, MSVC, pytest, unittest
- CLI with subcommands
- Documentation