Data Generation
Preparing training data for SFT and RAFT
Data Format
halo-forge expects JSONL files with this structure:
{"prompt": "Write a function to calculate factorial", "completion": "int factorial(int n) {...}"}
{"prompt": "Implement binary search", "completion": "int binary_search(int arr[], int n, int target) {...}"}
For RAFT, you only need prompts:
{"prompt": "Write a function to calculate factorial"}
{"prompt": "Implement binary search"}
Public Datasets
CodeForces
halo-forge data prepare \
--dataset codeforces_cpp \
--output data/codeforces.jsonl \
--limit 1000
Best for: C++ code generation, algorithmic problems
MBPP
halo-forge data prepare \
--dataset mbpp \
--output data/mbpp.jsonl
Best for: Python functions, simpler problems
HumanEval
halo-forge data prepare \
--dataset humaneval \
--output data/humaneval.jsonl
Best for: Evaluation benchmark, Python
LLM Generation
Generate domain-specific training data using LLMs.
With Ollama (Local)
halo-forge data generate \
--prompts prompts.txt \
--backend ollama \
--model deepseek-coder:6.7b \
--output generated.jsonl \
--samples 3
With Claude API
export ANTHROPIC_API_KEY=your_key
halo-forge data generate \
--prompts prompts.txt \
--backend anthropic \
--model claude-3-sonnet \
--output generated.jsonl
Prompt File Format
Write a function to reverse a linked list
Implement a thread-safe queue in C++
Create a binary search tree with insert and delete
Data Quality Tips
- Diversity matters: Include various problem types
- Verify examples: Run through your verifier before training
- Balance difficulty: Mix easy and hard problems
- Clean formatting: Consistent code style helps
Splitting Data
# Create train/test split
halo-forge data split \
--input data/all.jsonl \
--train data/train.jsonl \
--test data/test.jsonl \
--ratio 0.9