GGUF Export

GGUF Export

Export fine-tuned models to GGUF format for llama.cpp, Ollama, and local inference.

Why GGUF?

  • Fast inference - Optimized for CPU and GPU
  • Small files - Quantization reduces size 2-4x
  • No Python needed - Deploy without Python runtime
  • Ollama compatible - Easy model serving

Quick Start

# Install dependency
pip install llama-cpp-python

# Convert your model
python scripts/export_gguf.py \
  --model models/my_finetuned \
  --output my_model.gguf

Installation

CPU-only (simplest)

pip install llama-cpp-python

AMD GPU (ROCm)

CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --force-reinstall

Clone llama.cpp (most control)

git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
cd ~/llama.cpp && make

Quantization Options

TypeSizeQualityDescription
Q4_K_M~2GB/7BGoodRecommended
Q8_0~4GB/7BBestHigher quality
F16~7GB/7BOriginalNo quantization
# List all options
python scripts/export_gguf.py --list-quantizations

Using with Ollama

  1. Create a Modelfile:
FROM ./my_model.gguf
SYSTEM "You are an expert Windows systems programmer."
PARAMETER temperature 0.7
  1. Create and run:
ollama create mymodel -f Modelfile
ollama run mymodel

Complete Workflow

# Train
halo-forge raft train \
  --prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
  --model Qwen/Qwen2.5-Coder-1.5B \
  --verifier mingw \
  --cycles 6 \
  --output models/windows_raft

# Export
python scripts/export_gguf.py \
  --model models/windows_raft/final_model \
  --output windows_coder.Q4_K_M.gguf

# Deploy
ollama create windows-coder -f Modelfile
ollama run windows-coder

See Also