Training Data
The dataset used to teach a machine learning model patterns and relationships, directly shaping the model's capabilities and limitations.
Also known as: Training Dataset, Training Corpus, Training Set
Category: AI
Tags: ai, machine-learning, datasets, fundamentals, training
Explanation
Training data is the collection of examples from which a machine learning model learns. For language models, this primarily consists of text — but the quality, diversity, and composition of that text profoundly influence what the model can do and how it behaves.
**Types of Training Data for LLMs**:
- **Web text**: Crawled websites, forums, and online discussions (e.g., Common Crawl)
- **Books**: Fiction and non-fiction books providing sustained, high-quality writing
- **Code**: Source code from repositories like GitHub
- **Academic papers**: Scientific literature and research
- **Wikipedia**: Encyclopedic knowledge across many languages
- **Curated datasets**: Specifically prepared instruction-following examples, conversations, and task demonstrations
**Data Quality Hierarchy**:
Not all training data is equal. Models benefit most from:
1. High-quality, well-written text (books, edited articles)
2. Diverse domains and perspectives
3. Factually accurate information
4. Clean, well-formatted content
5. Appropriately balanced representation of topics
Low-quality data (spam, SEO content, duplicates) can degrade model performance.
**Scale of Modern Training Data**:
- GPT-3 was trained on ~300 billion tokens
- LLaMA was trained on ~1.4 trillion tokens
- Modern frontier models use multi-trillion token datasets
- The total amount of high-quality text available on the internet is estimated at 4–15 trillion tokens
**Data Processing Pipeline**:
1. **Collection**: Crawling, scraping, or curating text sources
2. **Filtering**: Removing low-quality, toxic, or duplicate content
3. **Deduplication**: Eliminating repeated text that would skew learning
4. **Cleaning**: Fixing formatting, removing boilerplate, normalizing text
5. **Tokenization**: Converting cleaned text into tokens
6. **Mixing**: Combining sources in proportions that optimize performance
**Training Data Challenges**:
- **Bias**: Training data reflects societal biases present in its sources
- **Knowledge cutoff**: The model only knows what was in its training data up to a certain date
- **Copyright and licensing**: Legal questions around using copyrighted text for training
- **Data contamination**: When evaluation benchmarks accidentally appear in training data, inflating performance metrics
- **Data poisoning**: Malicious actors can try to inject harmful content into training data
- **Scaling wall**: High-quality text data is finite — models may soon exhaust available human-written text
**Synthetic Data**:
As natural data becomes scarce, synthetic data (generated by AI models themselves) is increasingly used. However, training primarily on synthetic data risks model collapse — a degradation in quality and diversity over generations.
Related Concepts
← Back to all concepts