There’s a theory that a rising tide of LLM-generated nonsense will eventually drown both LLMs themselves and the internet as a whole. The idea goes like this: The first generation of LLMs is trained entirely on “real” material: the Gutenberg project, 4chan, that one article from Thought Catalog a decade ago, and everything in between. But as the output of those LLMs spreads across the internet, it also becomes part of the training data of future LLMs—and much of it is bullshit . As a result, the quality of newer LLMs’ training data is inferior to that of their predecessors—and by extension, so is their output. And as that output accumulates on the internet, it becomes part of future training data, and the cycle continues. With each passing day, the proportion of the internet that’s low-quality LLM-generated bullshit increases, until eventually all that’s left to train LLMs is the gibberish created by their predecessors.…