Benchmarking

Evaluate model performance with pass@k metrics

Quick Start

halo-forge benchmark run \
  --model models/raft/cycle_5_final \
  --prompts data/test.jsonl \
  --verifier gcc \
  --samples 20

Metrics

Compile Rate

Percentage of samples that compile:

Compile Rate = Samples that compile / Total samples

pass@k

Probability that at least one of k samples is correct:

MetricMeaning
pass@1Success with single attempt
pass@5Success within 5 attempts
pass@10Success within 10 attempts

pass@k is computed using an unbiased estimator when samples > k.

Running Benchmarks

Single Model

halo-forge benchmark run \
  --model models/raft/cycle_3_final \
  --prompts data/test.jsonl \
  --verifier gcc \
  --samples 20 \
  --k 1,5,10 \
  --output results/benchmark.json

Compare Models

# Benchmark each stage
for stage in sft raft/cycle_1 raft/cycle_3 raft/cycle_5; do
  halo-forge benchmark run \
    --model models/$stage/final_model \
    --prompts data/test.jsonl \
    --output results/${stage}.json
done

# Compare results
halo-forge benchmark compare results/*.json

Demo Benchmark

Validate the pipeline with built-in prompts:

halo-forge benchmark full \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --cycles 2

Output Format

{
  "total": 100,
  "passed": 55,
  "pass_rate": 0.55,
  "pass_at_k": {
    "1": 0.553,
    "5": 0.784,
    "10": 0.891
  },
  "by_category": {
    "algorithms": {"total": 30, "passed": 18},
    "data_structures": {"total": 25, "passed": 15}
  },
  "timing": {
    "generation_time": 1234.5,
    "verification_time": 45.2
  }
}

Demo Results

Quick validation benchmarks (16 prompts, 2 cycles):

ModelBaselineAfter 2 CyclesTime
0.5B32.0%32.0%41 min
1.5B67.2%67.2%52 min
3B97.7%99.2%79 min

Note: Demo benchmarks validate the pipeline works. Results from extended training runs will vary based on model, dataset, and configuration.

HumanEval Validation

3B model RAFT training on HumanEval subset (20 prompts, 3 cycles):

CyclePass RateKeptLossGrad Norm
121.2%170.563995.3
217.5%140.570194.0
322.5%180.5260.47

Key observations:

  • Gradient norm stabilized: 995 → 0.47 across cycles (good convergence)
  • Loss decreased: 0.563 → 0.526 (model learning)
  • Pass rate variance: Expected with small dataset (20 prompts)

Total training time: ~54 minutes (18 min/cycle)

Hardware Metrics

With --hardware-metrics:

{
  "hardware": {
    "gpu_peak_memory_gb": 11.9,
    "gpu_power_avg_w": 70.8,
    "gpu_energy_wh": 11.42
  }
}