Graduated Rewards

Why partial credit matters for RLVR training

The Problem with Binary Rewards

Binary rewards (0 or 1) create sparse gradients:

Code with syntax error    → 0.0
Code that almost compiles → 0.0  # Same as complete failure!
Code that compiles        → 1.0

The model can’t distinguish between “terrible” and “almost there”.

Graduated Reward Scale

halo-forge uses a graduated scale:

OutcomeRewardSignal
Syntax error0.0Complete failure
Missing includes0.1Structural issues
Type errors0.2Logic issues
Compiles with warnings0.3Almost there
Compiles clean0.5Compilation success
Runs, crashes0.6Runtime issues
Runs, wrong output0.7Logic correct-ish
Runs, partial output0.8Nearly correct
Correct output1.0Full success

How It Works

from halo_forge.rlvr.verifiers import RewardLevel

# From compilation result
reward = RewardLevel.from_compile_result(
    success=True,
    has_warnings=True
)
# Returns 0.3

# From execution result
reward = RewardLevel.from_execution_result(
    compiles=True,
    runs=True,
    correct=False
)
# Returns 0.7

Impact on Training

Binary Rewards

Cycle 1: 15% success → train on 15%
Cycle 2: 18% success → train on 18%
Cycle 3: 20% success → train on 20%

Limited learning signal.

Graduated Rewards

Cycle 1: 15% full success, but 40% above threshold
Cycle 2: 22% full success, 55% above threshold
Cycle 3: 30% full success, 65% above threshold

More samples contribute to learning.

Threshold Selection

The reward_threshold parameter controls what’s kept:

ThresholdEffect
0.3Keep code with warnings
0.5Keep clean compiles (recommended)
0.7Keep code that runs
1.0Only correct output

Start with 0.5 (compilation) and increase as model improves.

Curriculum Learning

Use graduated thresholds across cycles:

cycles:
  - reward_threshold: 0.3  # Early: accept warnings
  - reward_threshold: 0.5  # Mid: require clean compile
  - reward_threshold: 0.7  # Late: require execution

This naturally progresses from “make it compile” to “make it work”.

Research Background

The graduated reward approach is inspired by:

  • Reward shaping in RL literature
  • Curriculum learning (Bengio et al., 2009)
  • Sparse reward problem in robotics RL

Key insight: intermediate rewards accelerate learning when the final goal is hard to reach directly.