Troubleshooting

Common issues and solutions

GPU Issues

GPU Not Detected

# Check device access
ls -l /dev/dri /dev/kfd

# Check ROCm
rocm-smi

# Check PyTorch
python3 -c "import torch; print(torch.cuda.is_available())"

Fix: Add udev rules:

sudo tee /etc/udev/rules.d/99-amd-kfd.rules >/dev/null <<'EOF'
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
EOF
sudo udevadm control --reload-rules && sudo udevadm trigger

Low GPU Utilization

Symptom: GPU at 20-30% during training

Cause: Usually data loading bottleneck

Fix: Ensure correct settings:

dataloader_num_workers: 0      # Must be 0 for Strix Halo
dataloader_pin_memory: false   # Must be false

Wrong Memory Reported

Symptom: PyTorch shows ~25GB instead of 128GB

Explanation: This is normal. PyTorch reports VRAM allocation, not total unified memory.

Check actual usage:

radeontop  # Shows GTT usage
cat /sys/class/drm/card0/device/mem_info_vram_used

Memory Issues

Out of Memory During Training

Fix 1: Reduce batch size:

per_device_train_batch_size: 1  # Reduce from 2
gradient_accumulation_steps: 32  # Increase to compensate

Fix 2: Enable gradient checkpointing:

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false

Fix 3: Use chunked verification (already default in RAFT):

chunk_size = 200  # Verify in chunks

Memory Creep During RAFT

Symptom: Memory usage grows each cycle

Fix: Ensure cleanup between cycles:

gc.collect()
torch.cuda.empty_cache()

Check that verifier cleanup is called:

verifier.cleanup()

Training Issues

Loss Not Decreasing

Causes:

Learning rate too high or too low
Data quality issues
Model capacity too small

Fixes:

Try learning rate: 1e-5, 2e-5, 5e-5
Check data format
Use larger model

Training Hangs

Cause: Usually data loading or CUDA sync issues

Fix:

dataloader_num_workers: 0
dataloader_pin_memory: false

NaN Loss

Cause: Gradient explosion or bad data

Fixes:

Reduce learning rate
Enable gradient clipping: max_grad_norm: 0.3
Check for corrupt data samples

Verification Issues

Compiler Not Found

Error: Compiler 'g++' not found in PATH

Fix: Install compiler in toolbox:

# In toolbox
sudo dnf install gcc-c++

# For MinGW
sudo dnf install mingw64-gcc-c++

Verification Timeout

Symptom: Many samples timing out

Fixes:

Increase timeout: timeout: 60
Check for infinite loops in generated code
Add memory limits: memory_limit_mb: 256

Low Compile Rate

Symptom: < 20% samples compile

Causes:

Poor SFT model
Wrong verifier for language
Bad prompt format

Fixes:

Train longer SFT
Check verifier matches language
Verify prompt format in data

RAFT Issues

Cycle Degradation

Symptom: Performance drops after cycle 5-6

Cause: Overfitting to verification signal

Fixes:

Stop at peak cycle
Add prompt diversity
Reduce learning rate per cycle

Samples Not Cached

Symptom: Re-generating samples on resume

Fix: Check cache directory exists:

ls models/raft/cache/

Ensure sufficient disk space.

Verifier Resource Leak

Symptom: Temp files accumulating in /tmp

Fix: Ensure cleanup is called:

try:
    trainer.run(prompts, num_cycles=5)
finally:
    trainer.verifier.cleanup()

Getting Help

Check logs in output directory
Run with verbose logging
Create minimal reproduction
Check GitHub Issues