← Story & Explainers

Concept

Training data, in one diagram

Where models learn from, why it matters, and the licensing questions still unresolved.

A schematic diagram of nodes and arrows on graph paper.
A schematic diagram of nodes and arrows on graph paper.

A modern language model is trained on a mixture of web-crawled text, licensed datasets, and human-written feedback. The exact mix is rarely disclosed in full. The mix shapes what the model knows, what it gets wrong, and whose work it has learned from.

For publishers, the practical question is whether their archive was in the crawl. For readers, the practical question is whether the answers they receive are biased toward the parts of the web that were easiest to scrape.

The licensing questions — who owes whom what, for which use — are unresolved in most jurisdictions. Expect this to be the dominant policy conversation of the next two years.