Command Index
Complete index of all halo-forge commands and flags
Command Index
Complete reference for all halo forge commands, subcommands, and flags.
Command Hierarchy
halo-forge
├── config
│ └── validate
├── data
│ ├── prepare
│ ├── generate
│ └── validate
├── sft
│ ├── train
│ └── datasets
├── raft
│ └── train
├── benchmark
│ ├── run
│ └── full
├── inference [EXPERIMENTAL]
│ ├── optimize
│ ├── export
│ └── benchmark
├── vlm [EXPERIMENTAL]
│ ├── sft
│ ├── train
│ ├── benchmark
│ └── datasets
├── audio [EXPERIMENTAL]
│ ├── sft
│ ├── train
│ ├── benchmark
│ └── datasets
├── reasoning [EXPERIMENTAL]
│ ├── sft
│ ├── train
│ ├── benchmark
│ └── datasets
├── agentic [EXPERIMENTAL]
│ ├── sft
│ ├── train
│ ├── benchmark
│ └── datasets
├── info
└── test
Global Flags
These flags work with all commands:
| Flag | Short | Type | Description |
|---|---|---|---|
--quiet | -q | flag | Suppress terminal output (logs still written to file) |
Auto-Logging
All training and benchmark commands automatically log output to logs/ with timestamped filenames:
logs/
├── raft_train_20260110_143052.log
├── sft_train_20260110_121500.log
└── benchmark_run_20260110_160000.log
No need for manual tee or PYTHONUNBUFFERED. Use --quiet to suppress terminal output while still capturing logs.
UI Relaunch Context
The web UI now writes durable launch metadata for each training and benchmark run:
- File:
<output_dir>/launch_context.json - Purpose: enables Monitor/Results
Rerun,Clone to Form, and trainingResume Latest - Resume scope:
raft,vlm,audio,reasoning,agentic(checkpoint-based); benchmark supports rerun/clone only
Core Commands (Production Ready)
halo-forge config validate
Validate a configuration file.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
config | - | path | Yes | - | Path to config file |
--type | -t | string | No | auto | Config type: raft, sft, auto |
--verbose | -v | flag | No | false | Show config contents |
halo-forge config validate configs/raft_windows.yaml
halo-forge config validate configs/sft.yaml --type sft --verbose
halo-forge data prepare
Download and prepare public datasets.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--dataset | -d | string | No | - | Dataset name |
--output | -o | path | No | - | Output file path |
--template | - | string | No | qwen | Chat template |
--system-prompt | - | string | No | - | Override system prompt |
--list | - | flag | No | false | List available datasets |
halo-forge data prepare --list
halo-forge data prepare --dataset humaneval --output data/humaneval.jsonl
halo-forge data prepare --dataset mbpp --template qwen --system-prompt "You are a Python expert."
halo-forge data generate
Generate training data using LLM.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--topic | -t | string | No | - | Topic name |
--backend | -b | string | No | deepseek | LLM backend |
--model | - | string | No | - | Model name for backend |
--output | -o | path | No | - | Output file path |
--template | - | string | No | qwen | Chat template |
--list | - | flag | No | false | List available topics |
halo-forge data generate --list
halo-forge data generate --topic windows_api --backend deepseek --output data/windows.jsonl
halo-forge data validate
Validate dataset format.
This command is dependency-light by design: it validates local JSONL structure without requiring public dataset download dependencies.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
file | - | path | Yes | - | Path to JSONL file |
--preview | -p | flag | No | false | Show preview of examples |
halo-forge data validate data/training.jsonl
halo-forge data validate data/training.jsonl --preview
halo-forge sft train
Run supervised fine-tuning.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--config | -c | path | No | - | Config file path |
--model | -m | string | No | Qwen/Qwen2.5-Coder-7B | Base model |
--dataset | -d | string | No | - | HuggingFace dataset ID or short name |
--data | - | path | No | - | Local training data file (JSONL) |
--max-samples | - | int | No | - | Limit number of training samples |
--output | -o | path | No | models/sft | Output directory |
--epochs | - | int | No | 3 | Number of epochs |
--resume | - | path | No | - | Resume from checkpoint |
--dry-run | - | flag | No | false | Validate config without training |
Dataset short names: codealpaca, metamath, gsm8k_sft, llava, librispeech_sft, xlam_sft, glaive_sft
# Using HuggingFace dataset
halo-forge sft train --dataset codealpaca --model Qwen/Qwen2.5-Coder-3B --output models/sft_3b
# Using local data
halo-forge sft train --data data/sft.jsonl --model Qwen/Qwen2.5-Coder-3B --output models/sft_3b
# With sample limit
halo-forge sft train --dataset metamath --max-samples 50000 --epochs 2
# Resume from checkpoint
halo-forge sft train --config configs/sft.yaml --resume models/sft/checkpoint-500
halo-forge sft datasets
List available SFT datasets.
halo-forge sft datasets
Output shows datasets organized by domain (Code, Reasoning, VLM, Audio, Agentic) with HuggingFace IDs and sizes.
halo-forge raft train
Run RAFT (Reward-Ranked Fine-Tuning).
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--config | -c | path | No | - | Config file path |
--model | -m | string | No | Qwen/Qwen2.5-Coder-3B | Base model |
--checkpoint | - | path | No | - | SFT checkpoint path |
--prompts | -p | path | No | - | Prompts file |
--output | -o | path | No | models/raft | Output directory |
--cycles | - | int | No | 6 | Number of RAFT cycles |
--verifier | - | string | No | gcc | Verifier type (see below) |
--samples-per-prompt | - | int | No | 8 | Samples to generate per prompt |
--temperature | - | float | No | 0.7 | Generation temperature |
--max-new-tokens | - | int | No | 1024 | Max tokens to generate |
--keep-percent | - | float | No | 0.5 | Keep top X% of passing samples |
--reward-threshold | - | float | No | 0.5 | Minimum reward to pass |
--min-samples | - | int | No | - | Auto-adjust threshold if fewer pass |
--curriculum | - | string | No | none | Curriculum strategy |
--curriculum-stats | - | path | No | - | Historical stats file (for historical curriculum) |
--curriculum-start | - | float | No | 0.2 | Start fraction (for progressive curriculum) |
--curriculum-increment | - | float | No | 0.2 | Increment per cycle (for progressive curriculum) |
--reward-shaping | - | string | No | fixed | Reward shaping strategy |
--lr-decay | - | float | No | 0.85 | LR decay per cycle |
--min-lr | - | float | No | 1e-6 | Minimum learning rate |
--experimental-attention | - | flag | No | false | Enable experimental ROCm attention |
--system-prompt | - | string | No | (Windows prompt) | System prompt |
--host | - | string | No | - | MSVC verifier host |
--user | - | string | No | - | MSVC verifier user |
--ssh-key | - | path | No | - | MSVC verifier SSH key |
Verifier choices: gcc, mingw, msvc, rust, go, dotnet, powershell, humaneval, mbpp, python, auto
Curriculum choices: none, complexity, progressive, adaptive, historical
Reward shaping choices: fixed, annealing, adaptive, warmup
# Basic RAFT training
halo-forge raft train \
--model Qwen/Qwen2.5-Coder-3B \
--prompts data/prompts.jsonl \
--verifier mingw \
--cycles 6 \
--output models/raft_3b
# With SFT checkpoint and LR decay
halo-forge raft train \
--checkpoint models/sft_3b/final \
--prompts data/prompts.jsonl \
--verifier auto \
--lr-decay 0.85 \
--cycles 6
# With MSVC verifier
halo-forge raft train \
--prompts data/windows.jsonl \
--verifier msvc \
--host 10.0.0.152 \
--user keys \
--ssh-key ~/.ssh/win
halo-forge benchmark run
Run pass@k benchmark.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | path | Yes | - | Model path |
--prompts | -p | path | Yes | - | Prompts file |
--output | -o | path | No | - | Output file path |
--samples | - | int | No | 10 | Samples per prompt |
--k | - | string | No | 1,5,10 | k values (comma-separated) |
--max-prompts | - | int | No | - | Max prompts to evaluate |
--verifier | - | string | No | gcc | Verifier type (gcc, humaneval, mbpp, etc.) |
--base-model | - | string | No | Qwen/Qwen2.5-Coder-7B | Base model |
--system-prompt | - | string | No | (Windows prompt) | System prompt |
--host | - | string | No | - | MSVC host |
--user | - | string | No | - | MSVC user |
--ssh-key | - | path | No | - | MSVC SSH key |
--cross-compile | - | flag | No | false | Windows cross-compile (rust/go) |
--run-after-compile | - | flag | No | false | Run after compile |
--experimental-attention | - | flag | No | false | Enable experimental ROCm attention |
halo-forge benchmark run \
--model models/raft_3b/cycle_6 \
--prompts data/test.jsonl \
--verifier mingw \
--samples 10 \
--output results/benchmark.json
halo-forge benchmark full
Run comprehensive RAFT benchmark.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | string | No* | - | Model to benchmark |
--suite | -s | string | No* | - | Predefined suite |
--cycles | -c | int | No | 2 | RAFT cycles |
--output | -o | path | No | results/benchmarks | Output directory |
--quiet | -q | flag | No | false | Minimal output |
*Either --model or --suite is required.
Suite choices: all (0.5B, 1.5B, 3B), small (0.5B), medium (0.5B, 1.5B)
halo-forge benchmark full --model Qwen/Qwen2.5-Coder-0.5B --cycles 2
halo-forge benchmark full --suite all --output results/full_benchmark
halo-forge benchmark eval
Evaluate a model on standard benchmarks (HumanEval, MBPP, LiveCodeBench, etc.).
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | string | Yes | - | Model name or path |
--benchmark | -b | string | No | humaneval | Benchmark name |
--limit | - | int | No | - | Max samples to evaluate |
--output | -o | path | No | - | Output file path |
--samples-per-prompt | - | int | No | 5 | Samples per prompt for pass@k |
--run-after-compile | - | flag | No | false | Run compiled code |
--language | - | string | No | - | Language (cpp, rust, go) |
--verifier | - | string | No | - | Verifier type |
Benchmark choices: humaneval, mbpp, livecodebench, cpp, rust, go
halo-forge benchmark eval --model models/raft/final --benchmark humaneval --limit 164
halo-forge benchmark eval --model models/raft/final --benchmark cpp --language cpp --run-after-compile
halo-forge info
Show hardware and system information.
halo-forge info
halo-forge ui
Launch the web UI.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--host | - | string | No | 127.0.0.1 | Host to bind |
--port | -p | int | No | 8080 | Port to bind |
--reload | - | flag | No | false | Enable hot reload |
--open-browser | - | flag | No | false | Auto-open browser after startup |
--no-browser | - | flag | No | false | Disable browser auto-open (this is the default behavior) |
halo-forge ui --no-browser
halo-forge ui --open-browser
halo-forge ui --host 0.0.0.0 --port 8080
Common deep-link routes for operator workflows:
http://127.0.0.1:8080/training?mode=audio&ui_mode=quickstart&preset=audio_whisper_tiny
http://127.0.0.1:8080/benchmark?view=non_code&ui_mode=quickstart&preset=non_code_smoke
http://127.0.0.1:8080/inference?mode=optimize&ui_mode=quickstart&preset=optimize_int4_smoke
http://127.0.0.1:8080/ops-console?module=data&execution_mode=live
halo-forge test
Run pipeline validation tests.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--level | -l | string | No | standard | Test level |
--model | -m | string | No | Qwen/Qwen2.5-Coder-0.5B | Model for testing |
--verbose | -v | flag | No | false | Verbose output |
--baseline-file | - | path | No | tests/baselines/modality_runtime_baseline.v1.json | Baseline JSON path (modality / ops-burnin / all-module-qualification) |
--write-baseline | - | flag | No | false | Write/overwrite baseline (modality / ops-burnin / all-module-qualification) |
--compare-baseline | - | flag | No | false | Compare run to baseline and fail on hard drift |
--report-file | - | path | No | results/readiness/ops_e2e_launch_reliability.v1.json | Report output path (ops-e2e / ops-burnin / all-modules / walkthroughs / all-module-qualification / all-module-bootstrap / all-module-live) |
--strict | - | flag | No | false | Fail on status=fail modules (ops-e2e / ops-burnin / all-modules / walkthroughs / all-module-qualification / all-module-bootstrap / all-module-live) |
--seed | - | int | No | 42 | Deterministic seed (ops-e2e / ops-burnin / all-modules / walkthroughs / all-module-qualification / all-module-bootstrap / all-module-live) |
--fixture-pack | - | string | No | - | Fixture pack (v1) or custom path (ops-e2e / all-modules / all-module-qualification level) |
--burnin-profile | - | string | No | tiny-v1 | Dataset-backed burn-in profile (ops-burnin level) |
--profile | - | string | No | bounded-v1 | Readiness profile (all-modules level) or walkthrough profile (contract-v1/live-local) |
--qualification-profile | - | string | No | contract-v1 | Qualification profile (contract-v1 / fixture-v1 / live-local) for all-module-qualification |
--bootstrap-profile | - | string | No | contract-v1 | Bootstrap profile (contract-v1 / live-local) for all-module-bootstrap |
--live-profile | - | string | No | live-smoke-v1 | Live execution profile (live-smoke-v1 / live-local) for all-module-live |
--output-root | - | path | No | results/bootstrap | Evidence output root for all-module-bootstrap or all-module-live |
--module | - | string (repeatable) | No | - | Filter module(s) for all-modules, walkthroughs, all-module-qualification, all-module-bootstrap, or all-module-live |
--execute | - | flag | No | false | Execute bounded probes for walkthroughs when using --profile live-local |
--show-fix-commands | - | flag | No | false | Emit ALL_QUAL_FIX remediation lines for all-module-qualification |
Level choices: smoke (no GPU), standard (with GPU), full (with training), modality (deterministic modality fixture + smoke suite), ops-e2e (non-code launch lifecycle reliability), ops-burnin (bounded dataset-backed non-code burn-in), all-modules (coding + non-coding readiness checks), walkthroughs (internal/operator E2E walkthrough contract validation), all-module-qualification (explicit bounded lifecycle qualification orchestration), all-module-bootstrap (bounded evidence generation/remediation for all-module readiness), all-module-live (bounded live execution closure probes across all modules)
Baseline drift checks validate runtime contract stability, not model-quality promotion thresholds.
halo-forge test --level smoke
halo-forge test --level standard --verbose
halo-forge test --level full --model Qwen/Qwen2.5-Coder-1.5B
halo-forge test --level modality
halo-forge test --level modality --compare-baseline
halo-forge test --level modality --write-baseline
halo-forge test --level ops-e2e --fixture-pack v1 --report-file results/readiness/ops_e2e_launch_reliability.v1.json
halo-forge test --level ops-e2e --fixture-pack v1 --strict
halo-forge test --level ops-burnin --burnin-profile tiny-v1 --report-file results/readiness/ops_dataset_burnin.v1.json
halo-forge test --level ops-burnin --burnin-profile tiny-v1 --compare-baseline --strict
halo-forge test --level all-modules --fixture-pack v1 --report-file results/readiness/all_modules_readiness.v1.json
halo-forge test --level all-modules --fixture-pack v1 --strict
halo-forge test --level all-modules --module sft --module raft --strict
halo-forge test --level walkthroughs --profile contract-v1 --report-file .internal_docs/research_testing/walkthroughs/reports/all_module_e2e_walkthrough_report.v1.json
halo-forge test --level walkthroughs --module sft --module raft --profile live-local --execute
halo-forge test --level all-module-qualification --qualification-profile contract-v1 --report-file results/readiness/all_module_qualification.v1.json
halo-forge test --level all-module-qualification --qualification-profile fixture-v1 --fixture-pack v1 --compare-baseline --baseline-file tests/baselines/all_module_qualification_baseline.v1.json --strict
halo-forge test --level all-module-qualification --show-fix-commands
halo-forge test --level all-module-bootstrap --bootstrap-profile contract-v1 --report-file results/readiness/all_module_bootstrap.v1.json
halo-forge test --level all-module-bootstrap --bootstrap-profile live-local --module inference --strict
halo-forge test --level all-module-live --live-profile live-smoke-v1 --report-file results/readiness/all_module_live_execution.v1.json
halo-forge test --level all-module-live --live-profile live-local --module inference --strict
Equivalent script entrypoint:
python3 scripts/run_all_module_qualification.py \
--qualification-profile fixture-v1 \
--fixture-pack v1 \
--write-report \
--show-fix-commands \
--report-file results/readiness/all_module_qualification.v1.json
python3 scripts/run_all_module_bootstrap.py \
--bootstrap-profile contract-v1 \
--write-report \
--report-file results/readiness/all_module_bootstrap.v1.json
python3 scripts/run_all_module_live_matrix.py \
--live-profile live-smoke-v1 \
--write-report \
--report-file results/readiness/all_module_live_execution.v1.json
Non-code modality UI readiness reports (contract-only) can be generated with:
python3 scripts/run_non_code_modality_matrix.py \
--validate-training vlm=models/phase7d/vlm_phase7d \
--validate-training audio=models/phase7d/audio_phase7d \
--validate-training reasoning=models/phase7d/reasoning_phase7d \
--validate-training agentic=models/phase7d/agentic_phase7d \
--write-readiness-report \
--readiness-from-validation \
--readiness-report-file results/readiness/non_code_modalities_readiness.v1.json
status=pass|warn|fail is contract-based runtime readiness, not model-quality promotion.
Cross-module ops readiness reports (non-coding scope) can be generated with:
python3 scripts/run_ops_module_matrix.py \
--fixture-pack v1 \
--write-report \
--report-file results/readiness/ops_modules_readiness.v1.json
Strict fixture-backed gate (used in nightly CI):
python3 scripts/run_ops_module_matrix.py --fixture-pack v1 --strict
Ops E2E launch lifecycle reliability (non-coding scope):
python3 scripts/run_ops_e2e_reliability.py \
--fixture-pack v1 \
--write-report \
--report-file results/readiness/ops_e2e_launch_reliability.v1.json
Strict nightly E2E gate:
python3 scripts/run_ops_e2e_reliability.py --fixture-pack v1 --strict
Dataset-backed non-code burn-in report:
python3 scripts/run_ops_dataset_burnin.py \
--burnin-profile tiny-v1 \
--write-report \
--report-file results/readiness/ops_dataset_burnin.v1.json
Strict nightly burn-in + baseline drift gate:
python3 scripts/run_ops_dataset_burnin.py \
--burnin-profile tiny-v1 \
--strict \
--compare-baseline \
--baseline-file tests/baselines/ops_dataset_burnin_baseline.v1.json
All-module parity readiness (coding + non-coding):
python3 scripts/run_all_module_matrix.py \
--fixture-pack v1 \
--write-report \
--report-file results/readiness/all_modules_readiness.v1.json
Strict nightly all-module gate:
python3 scripts/run_all_module_matrix.py --fixture-pack v1 --strict
CI policy:
- PR/push CI uses non-strict report generation (informational for readiness status).
- Nightly CI uses strict mode and fails on module
status=failplus hard contract drift (readiness + qualification + bootstrap + live execution + E2E + dataset burn-in reports).
Readiness interpretation:
WARNcommonly indicates missing historical evidence (for example priortraining_summary.jsonor benchmark outputs) and remains non-blocking for UI launches.FAILindicates a contract/preflight issue; checklaunch_blocked,issue_code,severity,what_is_missing, andfix_nowin readiness/qualification payload entries.
Experimental Commands
These commands are in active development. APIs may change.
halo-forge inference optimize
Optimize model for deployment.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | path | Yes | - | Model path |
--target-precision | - | string | No | int4 | Target precision |
--target-latency | - | float | No | 50.0 | Target latency (ms) |
--calibration-data | - | path | No | - | Calibration data JSONL |
--output | -o | path | No | models/optimized | Output directory |
Precision choices: int4, int8, fp16
halo-forge inference optimize \
--model models/raft_7b/cycle_6 \
--target-precision int4 \
--output models/optimized
halo-forge inference export
Export model to deployment format.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | path | Yes | - | Model path |
--format | -f | string | Yes | - | Export format |
--quantization | -q | string | No | Q4_K_M | GGUF quantization |
--output | -o | path | Yes | - | Output path |
Format choices: gguf, onnx
GGUF quantization types: Q4_K_M, Q4_K_S, Q8_0, F16
halo-forge inference export \
--model models/trained \
--format gguf \
--quantization Q4_K_M \
--output models/model.gguf
halo-forge inference benchmark
Benchmark inference latency.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | path | Yes | - | Model path |
--prompts | -p | path | No | - | Test prompts JSONL |
--num-prompts | - | int | No | 10 | Number of prompts |
--max-tokens | - | int | No | 100 | Max tokens to generate |
--warmup | - | int | No | 3 | Warmup iterations |
--measure-memory | - | flag | No | false | Measure memory usage |
halo-forge inference benchmark \
--model models/optimized \
--num-prompts 50 \
--measure-memory
halo-forge vlm sft
SFT training for VLM.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | string | No | Qwen/Qwen2-VL-2B-Instruct | VLM model |
--dataset | -d | string | No | llava | SFT dataset name |
--max-samples | - | int | No | - | Limit training samples |
--output | -o | path | No | models/vlm_sft | Output directory |
--epochs | - | int | No | 2 | Number of epochs |
--dry-run | - | flag | No | false | Validate config only |
halo-forge vlm sft --dataset llava --model Qwen/Qwen2-VL-2B-Instruct --output models/vlm_sft
halo-forge vlm train
Train VLM with RAFT.
Real-training enabled. Use --allow-prototype-train only if a modality is temporarily re-gated.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | string | No | Qwen/Qwen2-VL-7B-Instruct | VLM model |
--dataset | -d | string | Yes | - | Dataset name or JSONL |
--output | -o | path | No | models/vlm_raft | Output directory |
--cycles | - | int | No | 6 | RAFT cycles |
--samples-per-prompt | - | int | No | 4 | Samples per prompt |
--perception-weight | - | float | No | 0.3 | Perception weight |
--reasoning-weight | - | float | No | 0.4 | Reasoning weight |
--output-weight | - | float | No | 0.3 | Output weight |
--lr-decay | - | float | No | 0.85 | LR decay per cycle |
--temperature | - | float | No | 0.7 | Generation temperature |
--limit | - | int | No | - | Limit dataset samples |
--allow-prototype-train | - | flag | No | false | Compatibility override for temporary prototype gating |
--resume-from-cycle | - | int | No | 0 | Resume from a previously saved cycle index |
--seed | - | int | No | 42 | Deterministic runtime seed |
Dataset choices: textvqa, docvqa, chartqa, realworldqa, mathvista
Supported model families: qwen2-vl, qwen-vl, llava
halo-forge vlm train \
--model Qwen/Qwen2-VL-7B-Instruct \
--dataset textvqa \
--cycles 6 \
--output models/vlm_textvqa \
--seed 42
halo-forge vlm benchmark
Benchmark VLM on dataset.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | path | Yes | - | VLM model path |
--dataset | -d | string | No | textvqa | Dataset name |
--split | - | string | No | validation | Dataset split |
--limit | - | int | No | 100 | Limit samples |
--output | -o | path | No | - | Output file |
halo-forge vlm benchmark \
--model models/vlm_raft/cycle_6 \
--dataset docvqa \
--limit 200 \
--output results/vlm_benchmark.json
halo-forge vlm datasets
List available VLM datasets.
halo-forge vlm datasets
halo-forge audio sft
SFT training for audio models.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | string | No | openai/whisper-small | Audio model |
--dataset | -d | string | No | librispeech_sft | SFT dataset name |
--max-samples | - | int | No | - | Limit training samples |
--output | -o | path | No | models/audio_sft | Output directory |
--epochs | - | int | No | 3 | Number of epochs |
--dry-run | - | flag | No | false | Validate config only |
halo-forge audio sft --dataset librispeech_sft --model openai/whisper-small --output models/audio_sft
halo-forge audio train
Train audio model with RAFT.
Real-training enabled. Use --allow-prototype-train only if a modality is temporarily re-gated.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | string | No | openai/whisper-small | Audio model |
--dataset | -d | string | Yes | - | Dataset name |
--task | - | string | No | asr | Task: asr, tts, classification |
--output | -o | path | No | models/audio_raft | Output directory |
--cycles | - | int | No | 4 | RAFT cycles |
--lr | - | float | No | 1e-5 | Learning rate |
--lr-decay | - | float | No | 0.85 | LR decay per cycle |
--limit | - | int | No | - | Limit dataset samples |
--allow-prototype-train | - | flag | No | false | Compatibility override for temporary prototype gating |
--resume-from-cycle | - | int | No | 0 | Resume from a previously saved cycle index |
--seed | - | int | No | 42 | Deterministic runtime seed |
Dataset choices: librispeech, common_voice, audioset, speech_commands
Supported model families: whisper
halo-forge audio train \
--model openai/whisper-small \
--dataset librispeech \
--task asr \
--cycles 4 \
--output models/audio_asr \
--seed 42
halo-forge audio benchmark
Benchmark audio model on dataset.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | path | Yes | - | Audio model path |
--dataset | -d | string | No | librispeech | Dataset name |
--task | - | string | No | asr | Task type |
--limit | - | int | No | 100 | Limit samples |
--output | -o | path | No | - | Output file |
halo-forge audio benchmark \
--model openai/whisper-small \
--dataset librispeech \
--limit 50 \
--output results/audio_benchmark.json
halo-forge audio datasets
List available audio datasets.
halo-forge audio datasets
halo-forge reasoning sft
SFT training for reasoning models.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | string | No | Qwen/Qwen2.5-3B-Instruct | Base model |
--dataset | -d | string | No | metamath | SFT dataset name |
--max-samples | - | int | No | - | Limit training samples |
--output | -o | path | No | models/reasoning_sft | Output directory |
--epochs | - | int | No | 2 | Number of epochs |
--dry-run | - | flag | No | false | Validate config only |
SFT Dataset choices: metamath, gsm8k_sft
halo-forge reasoning sft --dataset metamath --model Qwen/Qwen2.5-3B-Instruct --output models/reasoning_sft
halo-forge reasoning train
Train on math/reasoning datasets with RAFT.
Real-training enabled. Use --allow-prototype-train only if a modality is temporarily re-gated.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | string | No | Qwen/Qwen2.5-7B-Instruct | Base model |
--dataset | -d | string | Yes | - | Dataset name |
--output | -o | path | No | models/reasoning_raft | Output directory |
--cycles | - | int | No | 4 | RAFT cycles |
--lr | - | float | No | 1e-5 | Learning rate |
--lr-decay | - | float | No | 0.85 | LR decay per cycle |
--limit | - | int | No | - | Limit dataset samples |
--allow-prototype-train | - | flag | No | false | Compatibility override for temporary prototype gating |
--resume-from-cycle | - | int | No | 0 | Resume from a previously saved cycle index |
--seed | - | int | No | 42 | Deterministic runtime seed |
Supported model families: qwen2.5, qwen2, qwen, llama-3, llama3, mistral |
RAFT Dataset choices: gsm8k, math, aime
halo-forge reasoning train \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset gsm8k \
--cycles 4 \
--output models/reasoning_gsm8k \
--seed 42
halo-forge reasoning benchmark
Benchmark on math/reasoning dataset.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | path | Yes | - | Model path |
--dataset | -d | string | No | gsm8k | Dataset name |
--limit | - | int | No | 100 | Limit samples |
--output | -o | path | No | - | Output file |
halo-forge reasoning benchmark \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset gsm8k \
--limit 100 \
--output results/reasoning_benchmark.json
halo-forge reasoning datasets
List available math/reasoning datasets.
halo-forge reasoning datasets
Agentic Commands (Experimental)
halo-forge agentic sft
SFT training for tool calling models.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | string | No | Qwen/Qwen2.5-7B-Instruct | Base model |
--dataset | -d | string | No | xlam_sft | SFT dataset name |
--max-samples | - | int | No | - | Limit training samples |
--output | -o | path | No | models/agentic_sft | Output directory |
--epochs | - | int | No | 2 | Number of epochs |
--dry-run | - | flag | No | false | Validate config only |
SFT Dataset choices: xlam_sft, glaive_sft
halo-forge agentic sft --dataset xlam_sft --model Qwen/Qwen2.5-7B-Instruct --output models/agentic_sft
halo-forge agentic train
Train on tool calling datasets with RAFT.
Real-training enabled. Use --allow-prototype-train only if a modality is temporarily re-gated.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | string | No | Qwen/Qwen2.5-7B-Instruct | Base model |
--dataset | -d | string | No | xlam | RAFT Dataset: xlam, glaive |
--output | -o | path | No | models/agentic_raft | Output directory |
--cycles | - | int | No | 5 | RAFT cycles |
--lr | - | float | No | 5e-5 | Learning rate |
--lr-decay | - | float | No | 0.85 | LR decay per cycle |
--limit | - | int | No | - | Limit dataset samples |
--dry-run | - | flag | No | false | Validate config only |
--allow-prototype-train | - | flag | No | false | Compatibility override for temporary prototype gating |
--resume-from-cycle | - | int | No | 0 | Resume from a previously saved cycle index |
--seed | - | int | No | 42 | Deterministic runtime seed |
Supported model families: qwen2.5, qwen2, qwen, llama-3, llama3, mistral |
halo-forge agentic train \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset xlam \
--cycles 5 \
--output models/agentic_raft \
--seed 42
halo-forge agentic benchmark
Benchmark tool calling accuracy.
| Flag | Short | Type | Required | Default | Description |
|---|---|---|---|---|---|
--model | -m | string | No | Qwen/Qwen2.5-7B-Instruct | Model to benchmark |
--dataset | -d | string | No | xlam | Dataset: xlam, glaive |
--limit | - | int | No | 100 | Limit samples |
--output | -o | path | No | - | Output file |
halo-forge agentic benchmark \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset xlam \
--limit 100 \
--output results/agentic_benchmark.json
halo-forge agentic datasets
List available tool calling datasets.
halo-forge agentic datasets
Output:
Available Agentic / Tool Calling Datasets
============================================================
xlam [Tool Calling] - 60k verified, 3,673 APIs
glaive [Tool Calling] - 113k samples, irrelevance
toolbench [Tool Calling] - 188k samples, 16k APIs
hermes [Tool Calling] - Format reference
Exit Codes
| Code | Description |
|---|---|
| 0 | Success |
| 1 | General error |
| 2 | Invalid arguments |
| 3 | Configuration error |
| 4 | GPU not available |
| 5 | Verification failed |
Environment Variables
| Variable | Description |
|---|---|
PYTORCH_HIP_ALLOC_CONF | ROCm memory configuration |
HF_HOME | HuggingFace cache directory |
CUDA_VISIBLE_DEVICES | GPU selection |
HIP_VISIBLE_DEVICES | AMD GPU selection |
Quick Reference
Most Common Commands
# Test installation
halo-forge test --level smoke
# Train with RAFT
halo-forge raft train --prompts data/prompts.jsonl --verifier mingw --cycles 6
# Benchmark model
halo-forge benchmark run --model models/raft/cycle_6 --prompts data/test.jsonl
# Show info
halo-forge info
Verifier Quick Reference
| Verifier | Language | Cross-compile | Requires |
|---|---|---|---|
gcc | C/C++ | No | gcc installed |
mingw | C/C++ | Yes (Windows PE) | mingw-w64 |
msvc | C/C++ | Yes (Windows) | SSH to Windows |
rust | Rust | Yes (Windows) | rustc, cargo |
go | Go | Yes (Windows) | go installed |
dotnet | C# | Yes (Windows PE) | dotnet-sdk |
powershell | PowerShell | No | pwsh |
auto | Multi-lang | Varies | Depends on detected |