GGUF Export

Export fine-tuned models to GGUF format for llama.cpp, Ollama, and local inference.

Why GGUF?

Fast inference - Optimized for CPU and GPU
Small files - Quantization reduces size 2-4x
No Python needed - Deploy without Python runtime
Ollama compatible - Easy model serving

Quick Start

# Install dependency
pip install llama-cpp-python

# Convert your model
python scripts/export_gguf.py \
  --model models/my_finetuned \
  --output my_model.gguf

Installation

CPU-only (simplest)

pip install llama-cpp-python

AMD GPU (ROCm)

CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --force-reinstall

Clone llama.cpp (most control)

git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
cd ~/llama.cpp && make

Quantization Options

Type	Size	Quality	Description
Q4_K_M	~2GB/7B	Good	Recommended
Q8_0	~4GB/7B	Best	Higher quality
F16	~7GB/7B	Original	No quantization

# List all options
python scripts/export_gguf.py --list-quantizations

Using with Ollama

Create a Modelfile:

FROM ./my_model.gguf
SYSTEM "You are an expert Windows systems programmer."
PARAMETER temperature 0.7

Create and run:

ollama create mymodel -f Modelfile
ollama run mymodel

Complete Workflow

# Train
halo-forge raft train \
  --prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
  --model Qwen/Qwen2.5-Coder-1.5B \
  --verifier mingw \
  --cycles 6 \
  --output models/windows_raft

# Export
python scripts/export_gguf.py \
  --model models/windows_raft/final_model \
  --output windows_coder.Q4_K_M.gguf

# Deploy
ollama create windows-coder -f Modelfile
ollama run windows-coder

GGUF Export

Why GGUF?

Quick Start

Installation

CPU-only (simplest)

AMD GPU (ROCm)

Clone llama.cpp (most control)

Quantization Options

Using with Ollama

Complete Workflow

See Also