GGUF Export
GGUF Export
Export fine-tuned models to GGUF format for llama.cpp, Ollama, and local inference.
Why GGUF?
- Fast inference - Optimized for CPU and GPU
- Small files - Quantization reduces size 2-4x
- No Python needed - Deploy without Python runtime
- Ollama compatible - Easy model serving
Quick Start
# Install dependency
pip install llama-cpp-python
# Convert your model
python scripts/export_gguf.py \
--model models/my_finetuned \
--output my_model.gguf
Installation
CPU-only (simplest)
pip install llama-cpp-python
AMD GPU (ROCm)
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --force-reinstall
Clone llama.cpp (most control)
git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
cd ~/llama.cpp && make
Quantization Options
| Type | Size | Quality | Description |
|---|---|---|---|
| Q4_K_M | ~2GB/7B | Good | Recommended |
| Q8_0 | ~4GB/7B | Best | Higher quality |
| F16 | ~7GB/7B | Original | No quantization |
# List all options
python scripts/export_gguf.py --list-quantizations
Using with Ollama
- Create a
Modelfile:
FROM ./my_model.gguf
SYSTEM "You are an expert Windows systems programmer."
PARAMETER temperature 0.7
- Create and run:
ollama create mymodel -f Modelfile
ollama run mymodel
Complete Workflow
# Train
halo-forge raft train \
--prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
--model Qwen/Qwen2.5-Coder-1.5B \
--verifier mingw \
--cycles 6 \
--output models/windows_raft
# Export
python scripts/export_gguf.py \
--model models/windows_raft/final_model \
--output windows_coder.Q4_K_M.gguf
# Deploy
ollama create windows-coder -f Modelfile
ollama run windows-coder