Command Index

Complete index of all halo-forge commands and flags

Command Index

Complete reference for all halo forge commands, subcommands, and flags.


Command Hierarchy

halo-forge
├── config
│   └── validate
├── data
│   ├── prepare
│   ├── generate
│   └── validate
├── sft
│   ├── train
│   └── datasets
├── raft
│   └── train
├── benchmark
│   ├── run
│   └── full
├── inference          [EXPERIMENTAL]
│   ├── optimize
│   ├── export
│   └── benchmark
├── vlm                [EXPERIMENTAL]
│   ├── sft
│   ├── train
│   ├── benchmark
│   └── datasets
├── audio              [EXPERIMENTAL]
│   ├── sft
│   ├── train
│   ├── benchmark
│   └── datasets
├── reasoning          [EXPERIMENTAL]
│   ├── sft
│   ├── train
│   ├── benchmark
│   └── datasets
├── agentic            [EXPERIMENTAL]
│   ├── sft
│   ├── train
│   ├── benchmark
│   └── datasets
├── info
└── test

Global Flags

These flags work with all commands:

FlagShortTypeDescription
--quiet-qflagSuppress terminal output (logs still written to file)

Auto-Logging

All training and benchmark commands automatically log output to logs/ with timestamped filenames:

logs/
├── raft_train_20260110_143052.log
├── sft_train_20260110_121500.log
└── benchmark_run_20260110_160000.log

No need for manual tee or PYTHONUNBUFFERED. Use --quiet to suppress terminal output while still capturing logs.

UI Relaunch Context

The web UI now writes durable launch metadata for each training and benchmark run:

  • File: <output_dir>/launch_context.json
  • Purpose: enables Monitor/Results Rerun, Clone to Form, and training Resume Latest
  • Resume scope: raft, vlm, audio, reasoning, agentic (checkpoint-based); benchmark supports rerun/clone only

Core Commands (Production Ready)

halo-forge config validate

Validate a configuration file.

FlagShortTypeRequiredDefaultDescription
config-pathYes-Path to config file
--type-tstringNoautoConfig type: raft, sft, auto
--verbose-vflagNofalseShow config contents
halo-forge config validate configs/raft_windows.yaml
halo-forge config validate configs/sft.yaml --type sft --verbose

halo-forge data prepare

Download and prepare public datasets.

FlagShortTypeRequiredDefaultDescription
--dataset-dstringNo-Dataset name
--output-opathNo-Output file path
--template-stringNoqwenChat template
--system-prompt-stringNo-Override system prompt
--list-flagNofalseList available datasets
halo-forge data prepare --list
halo-forge data prepare --dataset humaneval --output data/humaneval.jsonl
halo-forge data prepare --dataset mbpp --template qwen --system-prompt "You are a Python expert."

halo-forge data generate

Generate training data using LLM.

FlagShortTypeRequiredDefaultDescription
--topic-tstringNo-Topic name
--backend-bstringNodeepseekLLM backend
--model-stringNo-Model name for backend
--output-opathNo-Output file path
--template-stringNoqwenChat template
--list-flagNofalseList available topics
halo-forge data generate --list
halo-forge data generate --topic windows_api --backend deepseek --output data/windows.jsonl

halo-forge data validate

Validate dataset format.

This command is dependency-light by design: it validates local JSONL structure without requiring public dataset download dependencies.

FlagShortTypeRequiredDefaultDescription
file-pathYes-Path to JSONL file
--preview-pflagNofalseShow preview of examples
halo-forge data validate data/training.jsonl
halo-forge data validate data/training.jsonl --preview

halo-forge sft train

Run supervised fine-tuning.

FlagShortTypeRequiredDefaultDescription
--config-cpathNo-Config file path
--model-mstringNoQwen/Qwen2.5-Coder-7BBase model
--dataset-dstringNo-HuggingFace dataset ID or short name
--data-pathNo-Local training data file (JSONL)
--max-samples-intNo-Limit number of training samples
--output-opathNomodels/sftOutput directory
--epochs-intNo3Number of epochs
--resume-pathNo-Resume from checkpoint
--dry-run-flagNofalseValidate config without training

Dataset short names: codealpaca, metamath, gsm8k_sft, llava, librispeech_sft, xlam_sft, glaive_sft

# Using HuggingFace dataset
halo-forge sft train --dataset codealpaca --model Qwen/Qwen2.5-Coder-3B --output models/sft_3b

# Using local data
halo-forge sft train --data data/sft.jsonl --model Qwen/Qwen2.5-Coder-3B --output models/sft_3b

# With sample limit
halo-forge sft train --dataset metamath --max-samples 50000 --epochs 2

# Resume from checkpoint
halo-forge sft train --config configs/sft.yaml --resume models/sft/checkpoint-500

halo-forge sft datasets

List available SFT datasets.

halo-forge sft datasets

Output shows datasets organized by domain (Code, Reasoning, VLM, Audio, Agentic) with HuggingFace IDs and sizes.


halo-forge raft train

Run RAFT (Reward-Ranked Fine-Tuning).

FlagShortTypeRequiredDefaultDescription
--config-cpathNo-Config file path
--model-mstringNoQwen/Qwen2.5-Coder-3BBase model
--checkpoint-pathNo-SFT checkpoint path
--prompts-ppathNo-Prompts file
--output-opathNomodels/raftOutput directory
--cycles-intNo6Number of RAFT cycles
--verifier-stringNogccVerifier type (see below)
--samples-per-prompt-intNo8Samples to generate per prompt
--temperature-floatNo0.7Generation temperature
--max-new-tokens-intNo1024Max tokens to generate
--keep-percent-floatNo0.5Keep top X% of passing samples
--reward-threshold-floatNo0.5Minimum reward to pass
--min-samples-intNo-Auto-adjust threshold if fewer pass
--curriculum-stringNononeCurriculum strategy
--curriculum-stats-pathNo-Historical stats file (for historical curriculum)
--curriculum-start-floatNo0.2Start fraction (for progressive curriculum)
--curriculum-increment-floatNo0.2Increment per cycle (for progressive curriculum)
--reward-shaping-stringNofixedReward shaping strategy
--lr-decay-floatNo0.85LR decay per cycle
--min-lr-floatNo1e-6Minimum learning rate
--experimental-attention-flagNofalseEnable experimental ROCm attention
--system-prompt-stringNo(Windows prompt)System prompt
--host-stringNo-MSVC verifier host
--user-stringNo-MSVC verifier user
--ssh-key-pathNo-MSVC verifier SSH key

Verifier choices: gcc, mingw, msvc, rust, go, dotnet, powershell, humaneval, mbpp, python, auto

Curriculum choices: none, complexity, progressive, adaptive, historical

Reward shaping choices: fixed, annealing, adaptive, warmup

# Basic RAFT training
halo-forge raft train \
  --model Qwen/Qwen2.5-Coder-3B \
  --prompts data/prompts.jsonl \
  --verifier mingw \
  --cycles 6 \
  --output models/raft_3b

# With SFT checkpoint and LR decay
halo-forge raft train \
  --checkpoint models/sft_3b/final \
  --prompts data/prompts.jsonl \
  --verifier auto \
  --lr-decay 0.85 \
  --cycles 6

# With MSVC verifier
halo-forge raft train \
  --prompts data/windows.jsonl \
  --verifier msvc \
  --host 10.0.0.152 \
  --user keys \
  --ssh-key ~/.ssh/win

halo-forge benchmark run

Run pass@k benchmark.

FlagShortTypeRequiredDefaultDescription
--model-mpathYes-Model path
--prompts-ppathYes-Prompts file
--output-opathNo-Output file path
--samples-intNo10Samples per prompt
--k-stringNo1,5,10k values (comma-separated)
--max-prompts-intNo-Max prompts to evaluate
--verifier-stringNogccVerifier type (gcc, humaneval, mbpp, etc.)
--base-model-stringNoQwen/Qwen2.5-Coder-7BBase model
--system-prompt-stringNo(Windows prompt)System prompt
--host-stringNo-MSVC host
--user-stringNo-MSVC user
--ssh-key-pathNo-MSVC SSH key
--cross-compile-flagNofalseWindows cross-compile (rust/go)
--run-after-compile-flagNofalseRun after compile
--experimental-attention-flagNofalseEnable experimental ROCm attention
halo-forge benchmark run \
  --model models/raft_3b/cycle_6 \
  --prompts data/test.jsonl \
  --verifier mingw \
  --samples 10 \
  --output results/benchmark.json

halo-forge benchmark full

Run comprehensive RAFT benchmark.

FlagShortTypeRequiredDefaultDescription
--model-mstringNo*-Model to benchmark
--suite-sstringNo*-Predefined suite
--cycles-cintNo2RAFT cycles
--output-opathNoresults/benchmarksOutput directory
--quiet-qflagNofalseMinimal output

*Either --model or --suite is required.

Suite choices: all (0.5B, 1.5B, 3B), small (0.5B), medium (0.5B, 1.5B)

halo-forge benchmark full --model Qwen/Qwen2.5-Coder-0.5B --cycles 2
halo-forge benchmark full --suite all --output results/full_benchmark

halo-forge benchmark eval

Evaluate a model on standard benchmarks (HumanEval, MBPP, LiveCodeBench, etc.).

FlagShortTypeRequiredDefaultDescription
--model-mstringYes-Model name or path
--benchmark-bstringNohumanevalBenchmark name
--limit-intNo-Max samples to evaluate
--output-opathNo-Output file path
--samples-per-prompt-intNo5Samples per prompt for pass@k
--run-after-compile-flagNofalseRun compiled code
--language-stringNo-Language (cpp, rust, go)
--verifier-stringNo-Verifier type

Benchmark choices: humaneval, mbpp, livecodebench, cpp, rust, go

halo-forge benchmark eval --model models/raft/final --benchmark humaneval --limit 164
halo-forge benchmark eval --model models/raft/final --benchmark cpp --language cpp --run-after-compile

halo-forge info

Show hardware and system information.

halo-forge info

halo-forge ui

Launch the web UI.

FlagShortTypeRequiredDefaultDescription
--host-stringNo127.0.0.1Host to bind
--port-pintNo8080Port to bind
--reload-flagNofalseEnable hot reload
--open-browser-flagNofalseAuto-open browser after startup
--no-browser-flagNofalseDisable browser auto-open (this is the default behavior)
halo-forge ui --no-browser
halo-forge ui --open-browser
halo-forge ui --host 0.0.0.0 --port 8080

Common deep-link routes for operator workflows:

http://127.0.0.1:8080/training?mode=audio&ui_mode=quickstart&preset=audio_whisper_tiny
http://127.0.0.1:8080/benchmark?view=non_code&ui_mode=quickstart&preset=non_code_smoke
http://127.0.0.1:8080/inference?mode=optimize&ui_mode=quickstart&preset=optimize_int4_smoke
http://127.0.0.1:8080/ops-console?module=data&execution_mode=live

halo-forge test

Run pipeline validation tests.

FlagShortTypeRequiredDefaultDescription
--level-lstringNostandardTest level
--model-mstringNoQwen/Qwen2.5-Coder-0.5BModel for testing
--verbose-vflagNofalseVerbose output
--baseline-file-pathNotests/baselines/modality_runtime_baseline.v1.jsonBaseline JSON path (modality / ops-burnin / all-module-qualification)
--write-baseline-flagNofalseWrite/overwrite baseline (modality / ops-burnin / all-module-qualification)
--compare-baseline-flagNofalseCompare run to baseline and fail on hard drift
--report-file-pathNoresults/readiness/ops_e2e_launch_reliability.v1.jsonReport output path (ops-e2e / ops-burnin / all-modules / walkthroughs / all-module-qualification / all-module-bootstrap / all-module-live)
--strict-flagNofalseFail on status=fail modules (ops-e2e / ops-burnin / all-modules / walkthroughs / all-module-qualification / all-module-bootstrap / all-module-live)
--seed-intNo42Deterministic seed (ops-e2e / ops-burnin / all-modules / walkthroughs / all-module-qualification / all-module-bootstrap / all-module-live)
--fixture-pack-stringNo-Fixture pack (v1) or custom path (ops-e2e / all-modules / all-module-qualification level)
--burnin-profile-stringNotiny-v1Dataset-backed burn-in profile (ops-burnin level)
--profile-stringNobounded-v1Readiness profile (all-modules level) or walkthrough profile (contract-v1/live-local)
--qualification-profile-stringNocontract-v1Qualification profile (contract-v1 / fixture-v1 / live-local) for all-module-qualification
--bootstrap-profile-stringNocontract-v1Bootstrap profile (contract-v1 / live-local) for all-module-bootstrap
--live-profile-stringNolive-smoke-v1Live execution profile (live-smoke-v1 / live-local) for all-module-live
--output-root-pathNoresults/bootstrapEvidence output root for all-module-bootstrap or all-module-live
--module-string (repeatable)No-Filter module(s) for all-modules, walkthroughs, all-module-qualification, all-module-bootstrap, or all-module-live
--execute-flagNofalseExecute bounded probes for walkthroughs when using --profile live-local
--show-fix-commands-flagNofalseEmit ALL_QUAL_FIX remediation lines for all-module-qualification

Level choices: smoke (no GPU), standard (with GPU), full (with training), modality (deterministic modality fixture + smoke suite), ops-e2e (non-code launch lifecycle reliability), ops-burnin (bounded dataset-backed non-code burn-in), all-modules (coding + non-coding readiness checks), walkthroughs (internal/operator E2E walkthrough contract validation), all-module-qualification (explicit bounded lifecycle qualification orchestration), all-module-bootstrap (bounded evidence generation/remediation for all-module readiness), all-module-live (bounded live execution closure probes across all modules)

Baseline drift checks validate runtime contract stability, not model-quality promotion thresholds.

halo-forge test --level smoke
halo-forge test --level standard --verbose
halo-forge test --level full --model Qwen/Qwen2.5-Coder-1.5B
halo-forge test --level modality
halo-forge test --level modality --compare-baseline
halo-forge test --level modality --write-baseline
halo-forge test --level ops-e2e --fixture-pack v1 --report-file results/readiness/ops_e2e_launch_reliability.v1.json
halo-forge test --level ops-e2e --fixture-pack v1 --strict
halo-forge test --level ops-burnin --burnin-profile tiny-v1 --report-file results/readiness/ops_dataset_burnin.v1.json
halo-forge test --level ops-burnin --burnin-profile tiny-v1 --compare-baseline --strict
halo-forge test --level all-modules --fixture-pack v1 --report-file results/readiness/all_modules_readiness.v1.json
halo-forge test --level all-modules --fixture-pack v1 --strict
halo-forge test --level all-modules --module sft --module raft --strict
halo-forge test --level walkthroughs --profile contract-v1 --report-file .internal_docs/research_testing/walkthroughs/reports/all_module_e2e_walkthrough_report.v1.json
halo-forge test --level walkthroughs --module sft --module raft --profile live-local --execute
halo-forge test --level all-module-qualification --qualification-profile contract-v1 --report-file results/readiness/all_module_qualification.v1.json
halo-forge test --level all-module-qualification --qualification-profile fixture-v1 --fixture-pack v1 --compare-baseline --baseline-file tests/baselines/all_module_qualification_baseline.v1.json --strict
halo-forge test --level all-module-qualification --show-fix-commands
halo-forge test --level all-module-bootstrap --bootstrap-profile contract-v1 --report-file results/readiness/all_module_bootstrap.v1.json
halo-forge test --level all-module-bootstrap --bootstrap-profile live-local --module inference --strict
halo-forge test --level all-module-live --live-profile live-smoke-v1 --report-file results/readiness/all_module_live_execution.v1.json
halo-forge test --level all-module-live --live-profile live-local --module inference --strict

Equivalent script entrypoint:

python3 scripts/run_all_module_qualification.py \
  --qualification-profile fixture-v1 \
  --fixture-pack v1 \
  --write-report \
  --show-fix-commands \
  --report-file results/readiness/all_module_qualification.v1.json

python3 scripts/run_all_module_bootstrap.py \
  --bootstrap-profile contract-v1 \
  --write-report \
  --report-file results/readiness/all_module_bootstrap.v1.json

python3 scripts/run_all_module_live_matrix.py \
  --live-profile live-smoke-v1 \
  --write-report \
  --report-file results/readiness/all_module_live_execution.v1.json

Non-code modality UI readiness reports (contract-only) can be generated with:

python3 scripts/run_non_code_modality_matrix.py \
  --validate-training vlm=models/phase7d/vlm_phase7d \
  --validate-training audio=models/phase7d/audio_phase7d \
  --validate-training reasoning=models/phase7d/reasoning_phase7d \
  --validate-training agentic=models/phase7d/agentic_phase7d \
  --write-readiness-report \
  --readiness-from-validation \
  --readiness-report-file results/readiness/non_code_modalities_readiness.v1.json

status=pass|warn|fail is contract-based runtime readiness, not model-quality promotion.

Cross-module ops readiness reports (non-coding scope) can be generated with:

python3 scripts/run_ops_module_matrix.py \
  --fixture-pack v1 \
  --write-report \
  --report-file results/readiness/ops_modules_readiness.v1.json

Strict fixture-backed gate (used in nightly CI):

python3 scripts/run_ops_module_matrix.py --fixture-pack v1 --strict

Ops E2E launch lifecycle reliability (non-coding scope):

python3 scripts/run_ops_e2e_reliability.py \
  --fixture-pack v1 \
  --write-report \
  --report-file results/readiness/ops_e2e_launch_reliability.v1.json

Strict nightly E2E gate:

python3 scripts/run_ops_e2e_reliability.py --fixture-pack v1 --strict

Dataset-backed non-code burn-in report:

python3 scripts/run_ops_dataset_burnin.py \
  --burnin-profile tiny-v1 \
  --write-report \
  --report-file results/readiness/ops_dataset_burnin.v1.json

Strict nightly burn-in + baseline drift gate:

python3 scripts/run_ops_dataset_burnin.py \
  --burnin-profile tiny-v1 \
  --strict \
  --compare-baseline \
  --baseline-file tests/baselines/ops_dataset_burnin_baseline.v1.json

All-module parity readiness (coding + non-coding):

python3 scripts/run_all_module_matrix.py \
  --fixture-pack v1 \
  --write-report \
  --report-file results/readiness/all_modules_readiness.v1.json

Strict nightly all-module gate:

python3 scripts/run_all_module_matrix.py --fixture-pack v1 --strict

CI policy:

  • PR/push CI uses non-strict report generation (informational for readiness status).
  • Nightly CI uses strict mode and fails on module status=fail plus hard contract drift (readiness + qualification + bootstrap + live execution + E2E + dataset burn-in reports).

Readiness interpretation:

  • WARN commonly indicates missing historical evidence (for example prior training_summary.json or benchmark outputs) and remains non-blocking for UI launches.
  • FAIL indicates a contract/preflight issue; check launch_blocked, issue_code, severity, what_is_missing, and fix_now in readiness/qualification payload entries.

Experimental Commands

These commands are in active development. APIs may change.

halo-forge inference optimize

Optimize model for deployment.

FlagShortTypeRequiredDefaultDescription
--model-mpathYes-Model path
--target-precision-stringNoint4Target precision
--target-latency-floatNo50.0Target latency (ms)
--calibration-data-pathNo-Calibration data JSONL
--output-opathNomodels/optimizedOutput directory

Precision choices: int4, int8, fp16

halo-forge inference optimize \
  --model models/raft_7b/cycle_6 \
  --target-precision int4 \
  --output models/optimized

halo-forge inference export

Export model to deployment format.

FlagShortTypeRequiredDefaultDescription
--model-mpathYes-Model path
--format-fstringYes-Export format
--quantization-qstringNoQ4_K_MGGUF quantization
--output-opathYes-Output path

Format choices: gguf, onnx

GGUF quantization types: Q4_K_M, Q4_K_S, Q8_0, F16

halo-forge inference export \
  --model models/trained \
  --format gguf \
  --quantization Q4_K_M \
  --output models/model.gguf

halo-forge inference benchmark

Benchmark inference latency.

FlagShortTypeRequiredDefaultDescription
--model-mpathYes-Model path
--prompts-ppathNo-Test prompts JSONL
--num-prompts-intNo10Number of prompts
--max-tokens-intNo100Max tokens to generate
--warmup-intNo3Warmup iterations
--measure-memory-flagNofalseMeasure memory usage
halo-forge inference benchmark \
  --model models/optimized \
  --num-prompts 50 \
  --measure-memory

halo-forge vlm sft

SFT training for VLM.

FlagShortTypeRequiredDefaultDescription
--model-mstringNoQwen/Qwen2-VL-2B-InstructVLM model
--dataset-dstringNollavaSFT dataset name
--max-samples-intNo-Limit training samples
--output-opathNomodels/vlm_sftOutput directory
--epochs-intNo2Number of epochs
--dry-run-flagNofalseValidate config only
halo-forge vlm sft --dataset llava --model Qwen/Qwen2-VL-2B-Instruct --output models/vlm_sft

halo-forge vlm train

Train VLM with RAFT. Real-training enabled. Use --allow-prototype-train only if a modality is temporarily re-gated.

FlagShortTypeRequiredDefaultDescription
--model-mstringNoQwen/Qwen2-VL-7B-InstructVLM model
--dataset-dstringYes-Dataset name or JSONL
--output-opathNomodels/vlm_raftOutput directory
--cycles-intNo6RAFT cycles
--samples-per-prompt-intNo4Samples per prompt
--perception-weight-floatNo0.3Perception weight
--reasoning-weight-floatNo0.4Reasoning weight
--output-weight-floatNo0.3Output weight
--lr-decay-floatNo0.85LR decay per cycle
--temperature-floatNo0.7Generation temperature
--limit-intNo-Limit dataset samples
--allow-prototype-train-flagNofalseCompatibility override for temporary prototype gating
--resume-from-cycle-intNo0Resume from a previously saved cycle index
--seed-intNo42Deterministic runtime seed

Dataset choices: textvqa, docvqa, chartqa, realworldqa, mathvista Supported model families: qwen2-vl, qwen-vl, llava

halo-forge vlm train \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --dataset textvqa \
  --cycles 6 \
  --output models/vlm_textvqa \
  --seed 42

halo-forge vlm benchmark

Benchmark VLM on dataset.

FlagShortTypeRequiredDefaultDescription
--model-mpathYes-VLM model path
--dataset-dstringNotextvqaDataset name
--split-stringNovalidationDataset split
--limit-intNo100Limit samples
--output-opathNo-Output file
halo-forge vlm benchmark \
  --model models/vlm_raft/cycle_6 \
  --dataset docvqa \
  --limit 200 \
  --output results/vlm_benchmark.json

halo-forge vlm datasets

List available VLM datasets.

halo-forge vlm datasets

halo-forge audio sft

SFT training for audio models.

FlagShortTypeRequiredDefaultDescription
--model-mstringNoopenai/whisper-smallAudio model
--dataset-dstringNolibrispeech_sftSFT dataset name
--max-samples-intNo-Limit training samples
--output-opathNomodels/audio_sftOutput directory
--epochs-intNo3Number of epochs
--dry-run-flagNofalseValidate config only
halo-forge audio sft --dataset librispeech_sft --model openai/whisper-small --output models/audio_sft

halo-forge audio train

Train audio model with RAFT. Real-training enabled. Use --allow-prototype-train only if a modality is temporarily re-gated.

FlagShortTypeRequiredDefaultDescription
--model-mstringNoopenai/whisper-smallAudio model
--dataset-dstringYes-Dataset name
--task-stringNoasrTask: asr, tts, classification
--output-opathNomodels/audio_raftOutput directory
--cycles-intNo4RAFT cycles
--lr-floatNo1e-5Learning rate
--lr-decay-floatNo0.85LR decay per cycle
--limit-intNo-Limit dataset samples
--allow-prototype-train-flagNofalseCompatibility override for temporary prototype gating
--resume-from-cycle-intNo0Resume from a previously saved cycle index
--seed-intNo42Deterministic runtime seed

Dataset choices: librispeech, common_voice, audioset, speech_commands Supported model families: whisper

halo-forge audio train \
  --model openai/whisper-small \
  --dataset librispeech \
  --task asr \
  --cycles 4 \
  --output models/audio_asr \
  --seed 42

halo-forge audio benchmark

Benchmark audio model on dataset.

FlagShortTypeRequiredDefaultDescription
--model-mpathYes-Audio model path
--dataset-dstringNolibrispeechDataset name
--task-stringNoasrTask type
--limit-intNo100Limit samples
--output-opathNo-Output file
halo-forge audio benchmark \
  --model openai/whisper-small \
  --dataset librispeech \
  --limit 50 \
  --output results/audio_benchmark.json

halo-forge audio datasets

List available audio datasets.

halo-forge audio datasets

halo-forge reasoning sft

SFT training for reasoning models.

FlagShortTypeRequiredDefaultDescription
--model-mstringNoQwen/Qwen2.5-3B-InstructBase model
--dataset-dstringNometamathSFT dataset name
--max-samples-intNo-Limit training samples
--output-opathNomodels/reasoning_sftOutput directory
--epochs-intNo2Number of epochs
--dry-run-flagNofalseValidate config only

SFT Dataset choices: metamath, gsm8k_sft

halo-forge reasoning sft --dataset metamath --model Qwen/Qwen2.5-3B-Instruct --output models/reasoning_sft

halo-forge reasoning train

Train on math/reasoning datasets with RAFT. Real-training enabled. Use --allow-prototype-train only if a modality is temporarily re-gated.

FlagShortTypeRequiredDefaultDescription
--model-mstringNoQwen/Qwen2.5-7B-InstructBase model
--dataset-dstringYes-Dataset name
--output-opathNomodels/reasoning_raftOutput directory
--cycles-intNo4RAFT cycles
--lr-floatNo1e-5Learning rate
--lr-decay-floatNo0.85LR decay per cycle
--limit-intNo-Limit dataset samples
--allow-prototype-train-flagNofalseCompatibility override for temporary prototype gating
--resume-from-cycle-intNo0Resume from a previously saved cycle index
--seed-intNo42Deterministic runtime seed
Supported model families: qwen2.5, qwen2, qwen, llama-3, llama3, mistral

RAFT Dataset choices: gsm8k, math, aime

halo-forge reasoning train \
  --model Qwen/Qwen2.5-7B-Instruct \
  --dataset gsm8k \
  --cycles 4 \
  --output models/reasoning_gsm8k \
  --seed 42

halo-forge reasoning benchmark

Benchmark on math/reasoning dataset.

FlagShortTypeRequiredDefaultDescription
--model-mpathYes-Model path
--dataset-dstringNogsm8kDataset name
--limit-intNo100Limit samples
--output-opathNo-Output file
halo-forge reasoning benchmark \
  --model Qwen/Qwen2.5-7B-Instruct \
  --dataset gsm8k \
  --limit 100 \
  --output results/reasoning_benchmark.json

halo-forge reasoning datasets

List available math/reasoning datasets.

halo-forge reasoning datasets

Agentic Commands (Experimental)

halo-forge agentic sft

SFT training for tool calling models.

FlagShortTypeRequiredDefaultDescription
--model-mstringNoQwen/Qwen2.5-7B-InstructBase model
--dataset-dstringNoxlam_sftSFT dataset name
--max-samples-intNo-Limit training samples
--output-opathNomodels/agentic_sftOutput directory
--epochs-intNo2Number of epochs
--dry-run-flagNofalseValidate config only

SFT Dataset choices: xlam_sft, glaive_sft

halo-forge agentic sft --dataset xlam_sft --model Qwen/Qwen2.5-7B-Instruct --output models/agentic_sft

halo-forge agentic train

Train on tool calling datasets with RAFT. Real-training enabled. Use --allow-prototype-train only if a modality is temporarily re-gated.

FlagShortTypeRequiredDefaultDescription
--model-mstringNoQwen/Qwen2.5-7B-InstructBase model
--dataset-dstringNoxlamRAFT Dataset: xlam, glaive
--output-opathNomodels/agentic_raftOutput directory
--cycles-intNo5RAFT cycles
--lr-floatNo5e-5Learning rate
--lr-decay-floatNo0.85LR decay per cycle
--limit-intNo-Limit dataset samples
--dry-run-flagNofalseValidate config only
--allow-prototype-train-flagNofalseCompatibility override for temporary prototype gating
--resume-from-cycle-intNo0Resume from a previously saved cycle index
--seed-intNo42Deterministic runtime seed
Supported model families: qwen2.5, qwen2, qwen, llama-3, llama3, mistral
halo-forge agentic train \
  --model Qwen/Qwen2.5-7B-Instruct \
  --dataset xlam \
  --cycles 5 \
  --output models/agentic_raft \
  --seed 42

halo-forge agentic benchmark

Benchmark tool calling accuracy.

FlagShortTypeRequiredDefaultDescription
--model-mstringNoQwen/Qwen2.5-7B-InstructModel to benchmark
--dataset-dstringNoxlamDataset: xlam, glaive
--limit-intNo100Limit samples
--output-opathNo-Output file
halo-forge agentic benchmark \
  --model Qwen/Qwen2.5-7B-Instruct \
  --dataset xlam \
  --limit 100 \
  --output results/agentic_benchmark.json

halo-forge agentic datasets

List available tool calling datasets.

halo-forge agentic datasets

Output:

Available Agentic / Tool Calling Datasets
============================================================
  xlam         [Tool Calling] - 60k verified, 3,673 APIs
  glaive       [Tool Calling] - 113k samples, irrelevance
  toolbench    [Tool Calling] - 188k samples, 16k APIs
  hermes       [Tool Calling] - Format reference

Exit Codes

CodeDescription
0Success
1General error
2Invalid arguments
3Configuration error
4GPU not available
5Verification failed

Environment Variables

VariableDescription
PYTORCH_HIP_ALLOC_CONFROCm memory configuration
HF_HOMEHuggingFace cache directory
CUDA_VISIBLE_DEVICESGPU selection
HIP_VISIBLE_DEVICESAMD GPU selection

Quick Reference

Most Common Commands

# Test installation
halo-forge test --level smoke

# Train with RAFT
halo-forge raft train --prompts data/prompts.jsonl --verifier mingw --cycles 6

# Benchmark model
halo-forge benchmark run --model models/raft/cycle_6 --prompts data/test.jsonl

# Show info
halo-forge info

Verifier Quick Reference

VerifierLanguageCross-compileRequires
gccC/C++Nogcc installed
mingwC/C++Yes (Windows PE)mingw-w64
msvcC/C++Yes (Windows)SSH to Windows
rustRustYes (Windows)rustc, cargo
goGoYes (Windows)go installed
dotnetC#Yes (Windows PE)dotnet-sdk
powershellPowerShellNopwsh
autoMulti-langVariesDepends on detected