Testing Guide

Reasoning Module Testing

Validate your reasoning training setup.

Prerequisites

Ensure you’re in the halo-forge toolbox:

toolbox enter halo-forge

Quick Validation

1. Check Dependencies

python -c "import sympy; print(f'SymPy: {sympy.__version__}')"

Expected output: SymPy: 1.12 or higher

2. Test Imports

from halo_forge.reasoning import MathVerifier, ReasoningRAFTTrainer
from halo_forge.reasoning.data import GSM8KLoader, MATHLoader
from halo_forge.reasoning.verifiers import AnswerExtractor

print("All imports successful!")

3. Test MathVerifier

from halo_forge.reasoning import MathVerifier

verifier = MathVerifier()

# Test numeric verification
result = verifier.verify(
    prompt="What is 2 + 2?",
    completion="Let me calculate: 2 + 2 = 4. \\boxed{4}",
    expected_answer="4"
)

print(f"Success: {result.success}")
print(f"Reward: {result.reward}")
# Expected: Success: True, Reward: 1.0

4. Test AnswerExtractor

from halo_forge.reasoning.verifiers import AnswerExtractor

extractor = AnswerExtractor()

# Test boxed format
answer = extractor.extract("After calculation, \\boxed{42}")
print(f"Boxed: {answer}")  # Expected: 42

# Test "answer is" format
answer = extractor.extract("The answer is 3.14159")
print(f"Text: {answer}")  # Expected: 3.14159

Run Unit Tests

pytest tests/test_reasoning.py -v

Expected: All tests passing

CLI Testing

List Datasets

halo-forge reasoning datasets

Expected output:

Available Math/Reasoning Datasets
============================================================
  gsm8k        [Grade School] - 8.5K problems, 2-8 step solutions
  math         [Competition ] - 12.5K problems, 7 subjects, 5 levels
  aime         [Competition ] - AIME problems (hard)

Benchmark (Dry Run)

Test without downloading the full dataset:

halo-forge reasoning benchmark \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --dataset gsm8k \
  --limit 5 \
  --dry-run

Full Benchmark

Run an actual benchmark (downloads model and dataset):

halo-forge reasoning benchmark \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --dataset gsm8k \
  --limit 20 \
  --output results/reasoning_test.json

Verification Strategies

The MathVerifier uses multiple strategies:

Numeric Match

# These all match "4"
"4"
"4.0"
"4.00"

Symbolic Match (via SymPy)

# These are symbolically equivalent
"x^2 + 2x + 1"
"(x + 1)^2"

Partial Credit

If the answer is wrong but reasoning steps are shown, partial credit (0.2) is awarded.

Common Issues

SymPy Not Installed

ImportError: No module named 'sympy'

Fix: pip install sympy>=1.12

Answer Not Extracted

If verification fails with “Could not extract answer”:

Ensure the model outputs \boxed{answer} format
Or uses “The answer is X” pattern

Low Accuracy on MATH

MATH dataset is very challenging. Try:

Start with GSM8K first
Use a stronger base model (7B+)
Increase training cycles