Test Verifiers

pytest and unittest verification

Test verifiers check code by running tests against it.

PytestVerifier

For Python code with pytest:

from halo_forge.rlvr.verifiers import PytestVerifier

verifier = PytestVerifier()
result = verifier.verify(python_code_with_tests)

Options

Option	Default	Description
`test_file`	None	External test file to run
`extra_args`	`['-v', '--tb=short']`	pytest arguments
`timeout`	60	Test timeout (seconds)
`max_workers`	4	Parallel test runs

Inline Tests

If code contains its own tests:

code = '''
def factorial(n):
    if n <= 1:
        return 1
    return n * factorial(n - 1)

def test_factorial():
    assert factorial(5) == 120
    assert factorial(0) == 1
'''

result = verifier.verify(code)

External Tests

Run external tests against generated code:

verifier = PytestVerifier(test_file="tests/test_solution.py")
result = verifier.verify(solution_code)

The external test file can import from the generated code.

UnittestVerifier

For Python’s built-in unittest:

from halo_forge.rlvr.verifiers import UnittestVerifier

verifier = UnittestVerifier()
result = verifier.verify(code_with_unittest)

Options

Option	Default	Description
`timeout`	60	Test timeout (seconds)
`max_workers`	4	Parallel test runs

Rewards

Test verifiers use graduated rewards based on test pass rate:

Outcome	Reward
All tests pass	1.0
75%+ pass	0.75
50%+ pass	0.5
Some pass	0.25
None pass	0.0

HumanEval/MBPP Verifiers

For standard benchmarks:

from halo_forge.rlvr.verifiers import HumanEvalVerifier, MBPPVerifier

# HumanEval
verifier = HumanEvalVerifier("data/rlvr/humaneval_full.jsonl")
result = verifier.verify(code, task_id="HumanEval/0")

# MBPP
verifier = MBPPVerifier("data/rlvr/mbpp_train_full.jsonl")
result = verifier.verify(code, task_id="mbpp/1")

These verifiers:

Load test cases from the dataset
Combine generated code with tests
Run pytest and return results

Batch Verification

verifier = PytestVerifier(max_workers=8)
codes = [code1, code2, code3, ...]
prompts = [prompt1, prompt2, prompt3, ...]

# Prompts help look up correct test cases
results = verifier.verify_batch(codes, prompts=prompts)

Best Practices

Set reasonable timeouts — Tests shouldn’t run forever
Use isolated environments — Prevent side effects between tests
Include edge cases — Tests should cover corner cases
Clean up resources — Tests should not leave temp files

CLI Usage

# With pytest verifier
halo-forge raft train \
  --checkpoint models/sft/final_model \
  --prompts data/prompts.jsonl \
  --verifier pytest \
  --cycles 5

# With HumanEval
halo-forge raft train \
  --checkpoint models/sft/final_model \
  --prompts data/humaneval_prompts.jsonl \
  --verifier humaneval \
  --cycles 5