GGUF Export Testing

GGUF Export Testing Guide

Step-by-step guide to testing GGUF model export for llama.cpp and Ollama deployment.

Prerequisites

Option 1: llama-cpp-python (Recommended)

# CPU-only (simplest)
pip install llama-cpp-python

# AMD GPU (ROCm)
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --force-reinstall

# NVIDIA GPU (CUDA)
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall

Option 2: llama.cpp Clone

git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
cd ~/llama.cpp

# Build (CPU)
make

# Build (ROCm)
make GGML_HIPBLAS=1

# Build (CUDA)
make GGML_CUDA=1

Optional: Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

Quick Test

Test the export script with a small model:

# List available quantization types
python scripts/export_gguf.py --list-quantizations

# Expected output:
# Available quantization types:
#   Q4_K_M - 4-bit quantization, medium (recommended)
#   Q4_K_S - 4-bit quantization, small
#   Q8_0   - 8-bit quantization
#   F16    - 16-bit float (no quantization)
#   F32    - 32-bit float (full precision)

Step-by-Step Testing

1. Export a Small Model

# Export Qwen 0.5B with Q4_K_M quantization
python scripts/export_gguf.py \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --output qwen_0.5b.Q4_K_M.gguf \
  --quantization Q4_K_M

# Expected output:
# Loading model from Qwen/Qwen2.5-Coder-0.5B...
# Converting to GGUF format...
# Applying Q4_K_M quantization...
# Saved to qwen_0.5b.Q4_K_M.gguf
# File size: ~XXX MB

2. Verify Export

# Check file was created
ls -lh qwen_0.5b.Q4_K_M.gguf

# Expected: ~300-400 MB for 0.5B model with Q4_K_M

# Verify file format
file qwen_0.5b.Q4_K_M.gguf
# Expected: data (GGUF uses custom binary format)

3. Test with llama.cpp

# If using llama.cpp clone
~/llama.cpp/llama-cli \
  -m qwen_0.5b.Q4_K_M.gguf \
  -p "Write a Python function to sort a list:" \
  -n 100

# Expected: Model loads and generates code

4. Test with Ollama

Create a Modelfile:

cat > Modelfile << 'EOF'
FROM ./qwen_0.5b.Q4_K_M.gguf

SYSTEM "You are an expert programmer."
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF

Import and run:

# Create model in Ollama
ollama create qwen-test -f Modelfile

# Run interactive chat
ollama run qwen-test

# Test with a prompt
ollama run qwen-test "Write hello world in Python"

# Cleanup when done
ollama rm qwen-test

Quantization Comparison Test

Compare different quantization levels:

# Export with different quantizations
python scripts/export_gguf.py \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --output qwen_0.5b.Q8_0.gguf \
  --quantization Q8_0

python scripts/export_gguf.py \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --output qwen_0.5b.F16.gguf \
  --quantization F16

# Compare file sizes
ls -lh qwen_0.5b.*.gguf

# Expected sizes (approximate for 0.5B model):
# Q4_K_M: ~300 MB
# Q8_0:   ~500 MB
# F16:    ~1 GB

End-to-End Workflow Test

Test the complete training-to-deployment pipeline:

Step 1: Train a Model

# Quick RAFT training (minimal cycles)
halo-forge raft train \
  --prompts datasets/windows_curriculum/windows_systems_rlvr.jsonl \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --verifier mingw \
  --cycles 2 \
  --samples-per-prompt 2 \
  --output models/test_raft

Step 2: Export to GGUF

python scripts/export_gguf.py \
  --model models/test_raft/cycle_2_final \
  --output test_raft.Q4_K_M.gguf \
  --quantization Q4_K_M

Step 3: Validate with llama.cpp

# Test inference
~/llama.cpp/llama-cli \
  -m test_raft.Q4_K_M.gguf \
  -p "Write a Windows API call to create a process:" \
  -n 200

Step 4: Deploy with Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./test_raft.Q4_K_M.gguf
SYSTEM "You are an expert Windows systems programmer specializing in low-level C code."
PARAMETER temperature 0.3
EOF

# Deploy
ollama create windows-coder -f Modelfile
ollama run windows-coder "Create a DLL with DllMain"

Validation Checklist

After testing, verify:

--list-quantizations shows all options
Export completes without errors
Output file has expected size
llama.cpp can load and run the model
Ollama can create and run the model
Generated output is coherent
Fine-tuned knowledge is preserved

Troubleshooting

“llama-cpp-python not installed”

Install with:

pip install llama-cpp-python

For GPU support, add the appropriate CMAKE_ARGS.

“conversion failed” or “unsupported architecture”

Some model architectures require the llama.cpp clone:

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
cd ~/llama.cpp && make

# Use --llama-cpp-path flag
python scripts/export_gguf.py \
  --model your_model \
  --output model.gguf \
  --llama-cpp-path ~/llama.cpp

Ollama “unsupported model format”

Ensure the GGUF file was created correctly:

# Check file header
xxd qwen_0.5b.Q4_K_M.gguf | head -1
# Should start with: 47475546 (GGUF magic bytes)

If invalid, re-export with a different method or quantization.

Model produces garbage output

Try a higher quality quantization (Q8_0 instead of Q4_K_M)
Verify the source model works before export
Check if the model architecture is fully supported

Export takes too long

Use a smaller model for testing (0.5B or 1.5B)
Ensure you have enough disk space (2x model size)
Close other memory-intensive applications

Size and Quality Reference

Model Size	Q4_K_M	Q8_0	F16
0.5B	~300 MB	~500 MB	~1 GB
1.5B	~1 GB	~1.5 GB	~3 GB
3B	~2 GB	~3 GB	~6 GB
7B	~4 GB	~7 GB	~14 GB

Quality typically:

F16: 100% (baseline)
Q8_0: ~99%
Q4_K_M: ~95-97%
Q4_K_S: ~93-95%

Next Steps

GGUF Export Guide - Full documentation
Inference Testing - Test other inference features
Ollama Documentation - Ollama deployment