Agentic Testing
Testing guide for tool calling module
Testing the Agentic Module
Step-by-step guide to validate the tool calling training pipeline.
Prerequisites
# Enter the toolbox
toolbox enter halo-forge
# Verify module is available
python -c "from halo_forge.agentic import ToolCallingVerifier; print('OK')"
Level 1: Unit Tests
Run the agentic unit tests:
python -m pytest tests/test_agentic.py -v
Expected: 32 tests passing
Key Tests
| Test | Description |
|---|---|
test_verify_correct_call | Correct function + args → reward 0.75 |
test_verify_wrong_function | Valid JSON, wrong function → reward 0.25 |
test_verify_false_positive | Called when shouldn’t → reward -0.25 |
test_extract_tool_calls | Parse <tool_call> from output |
test_filter_completions | Top K% filtering logic |
Level 2: CLI Validation
List Datasets
halo-forge agentic datasets
Expected output:
Available Agentic / Tool Calling Datasets
============================================================
xlam [Tool Calling] - 60k verified, 3,673 APIs
glaive [Tool Calling] - 113k samples, irrelevance
toolbench [Tool Calling] - 188k samples, 16k APIs
Dry Run Training
halo-forge agentic train --dry-run
Expected:
Agentic / Tool Calling RAFT Training
============================================================
Model: Qwen/Qwen2.5-7B-Instruct
Dataset: xlam
...
Configuration valid!
Level 3: Verifier Testing
Test the ToolCallingVerifier directly:
from halo_forge.agentic import ToolCallingVerifier
verifier = ToolCallingVerifier()
# Test correct call
output = '<tool_call>{"name": "get_weather", "arguments": {"city": "Paris"}}</tool_call>'
expected = [{"name": "get_weather", "arguments": {"city": "Paris"}}]
result = verifier.verify(output, expected_calls=expected)
print(f"Success: {result.success}") # True
print(f"Reward: {result.reward}") # 0.75
# Test false positive
result = verifier.verify(output, expected_calls=[], is_irrelevant=True)
print(f"Reward: {result.reward}") # -0.25
Level 4: Data Loading
Test dataset loading (requires network):
from halo_forge.agentic.data import XLAMLoader
loader = XLAMLoader()
samples = loader.load(limit=10)
print(f"Loaded: {len(samples)} samples")
for s in samples[:2]:
print(f" Query: {s.messages[0]['content'][:50]}...")
print(f" Tools: {len(s.tools)}")
Level 5: Benchmark Run
Run a small benchmark (requires GPU):
halo-forge agentic benchmark \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset xlam \
--limit 20 \
--output results/agentic_benchmark.json
Expected metrics:
accuracy: Overall correct ratejson_valid_rate: Valid JSON output ratefunction_accuracy: Correct function selection rate
Level 6: Training Run
Run a short training (requires GPU, ~2-4 hours):
halo-forge agentic train \
--model Qwen/Qwen2.5-7B-Instruct \
--dataset xlam \
--limit 500 \
--cycles 2 \
--output models/agentic_test
Monitor with TensorBoard
tensorboard --logdir models/agentic_test/tensorboard
Metrics to watch:
success_rate- Should increase over cyclesavg_reward- Should trend upwardkept_samples- High-quality samples per cycle
Troubleshooting
“No tool call found in output”
The model isn’t generating <tool_call> tags. Check:
- Is the prompt in correct Hermes format?
- Is the model instruction-tuned?
- Try a different model (Qwen2.5 recommended)
Low accuracy on irrelevance detection
The model calls tools when it shouldn’t. Solutions:
- Include more irrelevance samples in training data (15%+)
- Use Glaive dataset (7,500 irrelevance examples)
JSON syntax errors
Small models make JSON mistakes. Solutions:
- Use constrained decoding with GBNF grammar
- Use temperature 0.0 for production
- Consider larger model (7B+)
Validation Checklist
| Test | Command | Expected |
|---|---|---|
| Unit tests | pytest tests/test_agentic.py | 32 passed |
| CLI datasets | halo-forge agentic datasets | Lists 4 datasets |
| Dry run | halo-forge agentic train --dry-run | Config valid |
| Verifier | Python test above | Rewards correct |
| Data load | Python test above | Samples load |
| Benchmark | halo-forge agentic benchmark | JSON output |