Graduated Rewards
Why partial credit matters for RLVR training
The Problem with Binary Rewards
Binary rewards (0 or 1) create sparse gradients:
Code with syntax error → 0.0
Code that almost compiles → 0.0 # Same as complete failure!
Code that compiles → 1.0
The model can’t distinguish between “terrible” and “almost there”.
Graduated Reward Scale
halo-forge uses a graduated scale:
| Outcome | Reward | Signal |
|---|---|---|
| Syntax error | 0.0 | Complete failure |
| Missing includes | 0.1 | Structural issues |
| Type errors | 0.2 | Logic issues |
| Compiles with warnings | 0.3 | Almost there |
| Compiles clean | 0.5 | Compilation success |
| Runs, crashes | 0.6 | Runtime issues |
| Runs, wrong output | 0.7 | Logic correct-ish |
| Runs, partial output | 0.8 | Nearly correct |
| Correct output | 1.0 | Full success |
How It Works
from halo_forge.rlvr.verifiers import RewardLevel
# From compilation result
reward = RewardLevel.from_compile_result(
success=True,
has_warnings=True
)
# Returns 0.3
# From execution result
reward = RewardLevel.from_execution_result(
compiles=True,
runs=True,
correct=False
)
# Returns 0.7
Impact on Training
Binary Rewards
Cycle 1: 15% success → train on 15%
Cycle 2: 18% success → train on 18%
Cycle 3: 20% success → train on 20%
Limited learning signal.
Graduated Rewards
Cycle 1: 15% full success, but 40% above threshold
Cycle 2: 22% full success, 55% above threshold
Cycle 3: 30% full success, 65% above threshold
More samples contribute to learning.
Threshold Selection
The reward_threshold parameter controls what’s kept:
| Threshold | Effect |
|---|---|
| 0.3 | Keep code with warnings |
| 0.5 | Keep clean compiles (recommended) |
| 0.7 | Keep code that runs |
| 1.0 | Only correct output |
Start with 0.5 (compilation) and increase as model improves.
Curriculum Learning
Use graduated thresholds across cycles:
cycles:
- reward_threshold: 0.3 # Early: accept warnings
- reward_threshold: 0.5 # Mid: require clean compile
- reward_threshold: 0.7 # Late: require execution
This naturally progresses from “make it compile” to “make it work”.
Research Background
The graduated reward approach is inspired by:
- Reward shaping in RL literature
- Curriculum learning (Bengio et al., 2009)
- Sparse reward problem in robotics RL
Key insight: intermediate rewards accelerate learning when the final goal is hard to reach directly.