Inference Optimization

⚠️ Experimental Feature: Inference optimization is under active development. APIs may change.

Inference Optimization

Optimize trained models for efficient deployment with halo-forge’s inference tools.

Overview

After training with SFT and RAFT, models can be optimized for production:

Training Complete → Quantization → Quality Verification → Export

Key Features

FeatureDescriptionCLI Command
QuantizationReduce model size with INT4/INT8halo-forge inference optimize
Quality VerificationEnsure quality meets thresholdsAutomatic during optimize
GGUF ExportExport for llama.cpp/Ollamahalo-forge inference export --format gguf
ONNX ExportExport for cross-platform inferencehalo-forge inference export --format onnx
Latency BenchmarkingMeasure inference speedhalo-forge inference benchmark

Quick Start

# Optimize a trained model
halo-forge inference optimize \
  --model models/windows_raft_7b/cycle_6_final \
  --target-precision int4 \
  --output models/optimized

# Export to GGUF for Ollama
halo-forge inference export \
  --model models/optimized \
  --format gguf \
  --quantization Q4_K_M \
  --output models/windows-7b.gguf

# Benchmark latency
halo-forge inference benchmark \
  --model models/optimized \
  --num-prompts 20

Quantization Options

PrecisionSize ReductionQualityUse Case
int4~75%GoodEdge deployment
int8~50%BetterServer deployment
fp16~50%BestHigh-quality inference

Export Formats

GGUF (llama.cpp)

Best for:

  • Local inference with Ollama
  • CPU-only systems
  • Memory-constrained environments

Quantization types:

  • Q4_K_M - Recommended balance (quality/size)
  • Q4_K_S - Smaller, slightly lower quality
  • Q8_0 - Highest quality 8-bit
  • F16 - No quantization (largest)

ONNX

Best for:

  • Cross-platform deployment
  • Integration with ONNX Runtime
  • TensorRT/OpenVINO optimization
  • Web deployment

Quality Verification

During optimization, halo-forge automatically verifies:

  1. Latency - Meets target (default: 50ms)
  2. Quality - Output similarity to original model (default: 95%)
# Custom thresholds
halo-forge inference optimize \
  --model models/trained \
  --target-latency 100 \
  --target-precision int4

Standalone GGUF Export Script

For a simpler workflow, use the standalone export script:

# Install llama-cpp-python (one-time)
pip install llama-cpp-python

# Convert directly from trained model to GGUF
python scripts/export_gguf.py \
  --model models/windows_raft_1.5b/final_model \
  --output windows_coder.Q4_K_M.gguf \
  --quantization Q4_K_M

See the GGUF Export Guide for full documentation.

Next Steps