Learning Rate Strategies
Experimental learning rate recommendations for RAFT training
Experimental: These recommendations are based on observations, not exhaustive validation.
Why This Matters
During production training, we observed degradation after cycle 6 using constant LR:
| Cycle | Compile Rate | LR Used | Result |
|---|---|---|---|
| 1-6 | Improving → 46.7% | 5e-5 | Peak performance |
| 7 | 29.3% | 5e-5 | Degradation begins |
| 8 | 23.0% | 5e-5 | Further degradation |
Hypothesis: Decreasing LR across cycles might prevent late-cycle degradation.
Quick Reference
| Phase | Learning Rate | Notes |
|---|---|---|
| SFT (LoRA) | 2e-4 | Well-established |
| RAFT Cycle 1 | 5e-5 | ~1/4 of SFT LR |
| RAFT Cycle 5 | 2e-5 to 3e-5 | Theoretical decay |
Rule of thumb: RAFT LR ≈ SFT LR / 4
Why Different Learning Rates?
SFT vs RAFT
| Aspect | SFT | RAFT |
|---|---|---|
| Data source | Fixed, curated | Model-generated (filtered) |
| Goal | Learn new capabilities | Refine behavior |
| Risk | Overfit to training data | Mode collapse |
| Recommended LR | 2e-4 | 5e-5 |
The Distribution Shift Problem
Each RAFT cycle trains on data from the previous model:
Cycle 1: Train on outputs from base model
Cycle 2: Train on outputs from Cycle 1 model
...
This creates compounding effects:
- Good: Model improves at generating verified code
- Bad: Distribution narrows, diversity decreases
- Risk: Collapse to single “safe” pattern
Lower LR in later cycles makes smaller updates as distribution contracts.
Decay Strategies
Strategy A: Constant (Baseline)
training:
learning_rate: 5e-5 # same all cycles
Simple, but may cause degradation after 5-6 cycles.
Strategy B: Exponential Decay (Recommended)
def get_cycle_lr(base_lr: float, cycle: int, decay: float = 0.85) -> float:
return base_lr * (decay ** (cycle - 1))
| Cycle | LR (0.85 decay) |
|---|---|
| 1 | 5.0e-5 |
| 2 | 4.25e-5 |
| 3 | 3.61e-5 |
| 4 | 3.07e-5 |
| 5 | 2.61e-5 |
Strategy C: Manual Schedule
training:
learning_rate_schedule:
cycle_1: 5.0e-5
cycle_2: 4.0e-5
cycle_3: 3.0e-5
cycle_4: 2.5e-5
cycle_5: 2.0e-5
Full control, but requires tuning.
Decay Factor Comparison
Base LR: 5e-5 across 5 cycles
Factor | Cycle 1 | Cycle 3 | Cycle 5 | Notes
────────────┼─────────┼─────────┼─────────┼───────────────
Constant | 5.0e-5 | 5.0e-5 | 5.0e-5 | Baseline
0.95 (gentle)| 5.0e-5 | 4.5e-5 | 4.1e-5 | Minimal decay
0.85 (std) | 5.0e-5 | 3.6e-5 | 2.6e-5 | Recommended
0.70 (aggro)| 5.0e-5 | 2.5e-5 | 1.2e-5 | If std fails
Diagnostic Signals
LR Too High
- Training loss oscillates or spikes
- Cycle N+1 worse than Cycle N
- Gradient norm frequently hits max
- Outputs become repetitive
LR Too Low
- Training loss barely moves
- Multiple cycles, same verification rate
- Very small gradient norms (< 0.05)
Interaction Effects
| Change | LR Adjustment |
|---|---|
| LoRA rank ↑ | Slightly lower |
| Batch size ↑ | Slightly higher |
| Temperature ↑ | Can go higher |
| Smaller dataset | Lower |
Within-Cycle Settings
For short training (1 epoch on filtered samples):
training:
warmup_steps: 0 # Few total steps, skip warmup
lr_scheduler_type: linear
With gradient accumulation:
500 samples / 2 batch / 16 accumulation = ~16 optimizer steps
With so few steps, within-cycle scheduling has minimal effect. Focus on getting base LR right.
Recommendations
- Start with 5e-5 constant — Establish baseline
- Monitor gradient norms — Should be 0.1-0.2, not clipping
- Watch for degradation — If cycle N+1 drops, reduce LR
- Try 0.85 decay — If constant causes late degradation
- Log everything — Data drives decisions