Reasoning Datasets

Math & Reasoning Datasets

Available datasets for mathematical reasoning training.

Supported Datasets

GSM8K (Grade School Math)

Size: 8,500 problems
Difficulty: Elementary to middle school
Source: HuggingFace

Problems require 2-8 reasoning steps. Ideal for teaching basic mathematical reasoning.

# Benchmark
halo-forge reasoning benchmark --dataset gsm8k --limit 100

# Training
halo-forge reasoning train --dataset gsm8k --cycles 4

Example problem:

Janet's ducks lay 16 eggs per day. She eats three for breakfast 
every morning and bakes muffins for her friends every day with 
four. She sells the remainder at the farmers' market daily for 
$2 per fresh duck egg. How much in dollars does she make every 
day at the farmers' market?

MATH (Competition Mathematics)

Size: 12,500 problems
Difficulty: High school to competition level
Subjects: Algebra, Geometry, Number Theory, Counting, Probability, Precalculus, Intermediate Algebra
Source: HuggingFace

Competition-level problems across 5 difficulty levels.

# Benchmark
halo-forge reasoning benchmark --dataset math --limit 50

# Training (requires strong base model)
halo-forge reasoning train --dataset math --cycles 4

AIME (American Invitational Mathematics Examination)

Size: ~300 problems
Difficulty: Advanced competition
Source: Historical AIME exams

Very challenging problems. Recommended only for models that already perform well on GSM8K and MATH.

Data Format

All datasets are loaded via HuggingFace and converted to a standard format:

@dataclass
class ReasoningSample:
    question: str      # The math problem
    answer: str        # Expected final answer
    solution: str      # Step-by-step solution (if available)
    difficulty: str    # Difficulty level (if available)
    subject: str       # Math subject (if available)

Loading Datasets in Python

from halo_forge.reasoning.data import load_math_dataset

# Load GSM8K
gsm8k = load_math_dataset("gsm8k", split="train", limit=1000)

# Load MATH
math_data = load_math_dataset("math", split="train", limit=500)

for sample in gsm8k:
    print(f"Q: {sample.question[:50]}...")
    print(f"A: {sample.answer}")

Dataset Statistics

Dataset	Train	Test	Avg Steps	Difficulty
GSM8K	7,473	1,319	4.2	Easy-Medium
MATH	7,500	5,000	6.8	Medium-Hard
AIME	~300	-	8+	Very Hard

Curriculum Learning

For best results, train in order of difficulty:

GSM8K (4 cycles) - Build basic reasoning
MATH (Level 1-3) (4 cycles) - Intermediate skills
MATH (Level 4-5) (2 cycles) - Advanced techniques

# Stage 1: GSM8K
halo-forge reasoning train --dataset gsm8k --cycles 4 --output models/stage1

# Stage 2: MATH
halo-forge reasoning train --dataset math --cycles 4 --output models/stage2