Testing Guide

Inference Module Testing Guide

Step-by-step guide to testing the halo forge Inference optimization module.

Prerequisites

Required Dependencies

pip install torch transformers accelerate

Optional Dependencies

DependencyRequired ForInstall
bitsandbytesINT4/INT8 quantizationpip install bitsandbytes
llama-cpp-pythonGGUF exportpip install llama-cpp-python
optimumONNX exportpip install optimum
onnxruntimeONNX inferencepip install onnxruntime

Hardware Requirements

  • Minimum: 16 GB RAM, 8 GB VRAM
  • Recommended: 32 GB RAM, 24 GB VRAM (for 7B models)
  • ROCm Support: AMD Radeon RX 7900 series or higher

Unit Tests

Run the full inference test suite:

# From project root
cd /path/to/halo-forge

# Run all inference tests
pytest tests/test_inference.py -v

# Run specific test class
pytest tests/test_inference.py::TestOptimizationConfig -v
pytest tests/test_inference.py::TestInferenceOptimizer -v
pytest tests/test_inference.py::TestGGUFExporter -v

Expected Test Coverage

Test ClassTestsDescription
TestOptimizationConfig3Config defaults and validation
TestInferenceOptimizer4Optimizer initialization and checks
TestVerifier3InferenceOptimizationVerifier
TestQATConfig2Quantization-aware training config
TestQATTrainer3QAT training workflow
TestCalibration3Calibration dataset creation
TestGGUFExporter3GGUF export workflow
TestONNXExporter3ONNX export workflow
TestIntegration2End-to-end workflows

Dry Run Testing

Test commands without running actual optimization:

# Test CLI parsing and config validation
halo-forge inference optimize \
  --model Qwen/Qwen2.5-Coder-1.5B \
  --target-precision int4 \
  --output /tmp/test-output \
  --dry-run

# Expected output:
# Dry run mode enabled. Configuration validated successfully.
# Model: Qwen/Qwen2.5-Coder-1.5B
# Target precision: int4
# Output: /tmp/test-output

Manual Testing Steps

1. Test Quantization

# INT4 quantization (requires bitsandbytes)
halo-forge inference optimize \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --target-precision int4 \
  --output models/quantized_int4

# Verify output
ls -la models/quantized_int4/
# Should contain: config.json, model.safetensors, tokenizer.json

2. Test GGUF Export

Using the CLI:

halo-forge inference export \
  --model models/quantized_int4 \
  --format gguf \
  --quantization Q4_K_M \
  --output models/test.gguf

Using the standalone script:

# List available quantization types
python scripts/export_gguf.py --list-quantizations

# Export with Q4_K_M (recommended)
python scripts/export_gguf.py \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --output test_model.Q4_K_M.gguf \
  --quantization Q4_K_M

# Verify the output
file test_model.Q4_K_M.gguf
# Expected: test_model.Q4_K_M.gguf: data

3. Test ONNX Export

halo-forge inference export \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --format onnx \
  --output models/test_onnx/

# Verify output
ls models/test_onnx/
# Should contain: model.onnx, config.json

4. Test Latency Benchmarking

halo-forge inference benchmark \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --num-prompts 10
  
# Expected output:
# Benchmarking Qwen/Qwen2.5-Coder-0.5B...
# Average latency: XX.X ms
# P95 latency: XX.X ms
# Throughput: XX.X tokens/sec

Validation Checklist

After running tests, verify:

  • All unit tests pass (pytest tests/test_inference.py)
  • Dry run mode works without errors
  • Quantized models load and generate text
  • GGUF export produces valid files
  • ONNX export produces valid files
  • Benchmark completes without errors

Troubleshooting

“bitsandbytes not installed”

This is a warning, not an error. Install if you need quantization:

pip install bitsandbytes

“llama-cpp-python not installed”

Required only for GGUF export. Install with:

# CPU-only
pip install llama-cpp-python

# AMD GPU (ROCm)
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --force-reinstall

GGUF export fails

If llama-cpp-python doesn’t work, clone llama.cpp directly:

git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
cd ~/llama.cpp && make

# Then export
python scripts/export_gguf.py \
  --model your_model \
  --output model.gguf \
  --llama-cpp-path ~/llama.cpp

Out of Memory (OOM)

  • Use smaller models (0.5B or 1.5B) for testing
  • Set --target-precision int4 to reduce memory
  • Add --device cpu if GPU memory is insufficient

ROCm-specific Issues

Set environment variables if you encounter HIP errors:

export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export HSA_OVERRIDE_GFX_VERSION=11.5.1  # For Strix Halo

Or use the experimental attention flag:

halo-forge inference optimize \
  --model your_model \
  --experimental-attention \
  ...

Next Steps