Code Datasets

Code Datasets

Guide to obtaining and using datasets for code generation training.

Available Datasets

Built-in Datasets

List available datasets with the CLI:

halo-forge data prepare --list
DatasetSourceLanguageSizeUse Case
codeforces_cppopen-r1/codeforces-cotsC++~4000Competitive programming
codeforces_pythonopen-r1/codeforces-cotsPython~1000Competitive programming
codeforces_rustopen-r1/codeforces-cotsRust~500Systems programming
mbppgoogle-research-datasets/mbppPython~1000Basic Python
humanevalopenai/openai_humanevalPython164Function synthesis
humaneval_plusevalplus/humanevalplusPython164Extended HumanEval
livecodebenchlivecodebench/code_generation_liteMultipleVariableReal-world coding

Downloading Datasets

Single Dataset

# Download and format CodeForces C++
halo-forge data prepare \
  --dataset codeforces_cpp \
  --output data/codeforces_cpp.jsonl

# Download MBPP
halo-forge data prepare \
  --dataset mbpp \
  --output data/mbpp.jsonl

Multiple Datasets

# Download several datasets
for ds in codeforces_cpp mbpp humaneval_plus; do
  halo-forge data prepare --dataset $ds --output data/${ds}.jsonl
done

# Combine for training
cat data/*.jsonl > data/combined_train.jsonl

Sample Data Locations

The repository includes sample data for quick testing:

PathDescription
data/rlvr/humaneval_prompts.jsonlHumanEval prompts for RLVR
data/rlvr/mbpp_train_prompts.jsonlMBPP prompts for RLVR
data/samples/codeforces_cpp_500.jsonl500 CodeForces C++ samples
datasets/windows_curriculum/Windows systems programming

Using Sample Data

# Quick RAFT training with HumanEval
halo-forge raft train \
  --prompts data/rlvr/humaneval_prompts.jsonl \
  --model Qwen/Qwen2.5-Coder-0.5B \
  --verifier python \
  --cycles 3

# SFT with CodeForces samples
halo-forge sft train \
  --data data/samples/codeforces_cpp_500_sft.jsonl \
  --model Qwen/Qwen2.5-Coder-0.5B

Windows Curriculum

A specialized dataset for Windows systems programming:

# Location
ls datasets/windows_curriculum/

# Files:
# - windows_systems_full_rlvr.jsonl (361 problems, RLVR format)
# - windows_systems_full_sft.jsonl (SFT format)
# - curriculum_order_full.json (tier metadata)

Tier Structure

TierLevelProblemsTopics
1Foundations84Process info, file I/O, registry
2Core APIs128Process enum, memory mapping, pipes
3Intermediate72PE parsing, tokens, services
4Advanced77ETW, native API, syscalls

Training with Windows Curriculum

# RAFT training (requires Windows build server)
halo-forge raft train \
  --prompts datasets/windows_curriculum/windows_systems_full_rlvr.jsonl \
  --model Qwen/Qwen2.5-Coder-1.5B \
  --verifier msvc \
  --cycles 6 \
  --output models/windows_coder

Data Formats

RLVR Format (for RAFT training)

{
  "prompt": "Write a function to sort a list...",
  "test_cases": [
    {"input": "[3,1,2]", "expected": "[1,2,3]"}
  ]
}

SFT Format (for supervised training)

{
  "messages": [
    {"role": "system", "content": "You are an expert programmer."},
    {"role": "user", "content": "Write a function to sort a list..."},
    {"role": "assistant", "content": "```python\ndef sort_list(lst):\n    return sorted(lst)\n```"}
  ]
}

Creating Custom Datasets

From HuggingFace

from datasets import load_dataset
import json

# Load any HuggingFace dataset
ds = load_dataset("your_org/your_dataset", split="train")

# Convert to RLVR format
with open("custom_rlvr.jsonl", "w") as f:
    for item in ds:
        record = {
            "prompt": item["problem"],
            "test_cases": item.get("test_cases", [])
        }
        f.write(json.dumps(record) + "\n")

Using LLM Generation

# Generate data with DeepSeek
halo-forge data generate \
  --topic rust_async \
  --backend deepseek \
  --num-examples 100 \
  --output data/rust_async.jsonl

HuggingFace Sources

DatasetHuggingFace PathDescription
CodeForcesopen-r1/codeforces-cotsCompetitive programming with CoT
MBPPgoogle-research-datasets/mbppBasic Python problems
HumanEvalopenai/openai_humanevalClassic benchmark
HumanEval+evalplus/humanevalplusExtended tests
LiveCodeBenchlivecodebench/code_generation_liteModern benchmarks
Appscodeparrot/appsLarge coding dataset
CodeContestsdeepmind/code_contestsCompetition problems

Downloading Directly

from datasets import load_dataset

# Load CodeForces
cf = load_dataset("open-r1/codeforces-cots", split="train")

# Filter by language
cpp_problems = cf.filter(lambda x: x["language"] == "cpp")

# Save
cpp_problems.to_json("codeforces_cpp.jsonl")

Dataset Validation

Validate dataset format before training:

# Validate format
halo-forge data validate --file data/custom.jsonl

# Expected output:
# Validated 1000 samples
# Format: RLVR
# Fields: prompt, test_cases
# Errors: 0

Next Steps