“AI Inbreeding,” The Phenomenon Threatening Artificial Intelligence

PARIS—For several months now, a strange phenomenon has been puzzling internet users: images generated by artificial intelligence are taking on a yellowish tint. Whether in memes, photos or videos, this visual bias has become so widespread that numerous tutorials on how to “de-yellow” these images are now circulating on the internet. For specialists, this is about more than just aesthetics. This uniformity could be a sign of a deeper crisis threatening AI. The cause? A form of digital “inbreeding” that occurs when AI models are trained on data already produced by other AI systems.

“Training future generations of software models on previous generations ultimately causes the model to collapse,” says Alain Goudey, associate dean in charge of digital technology at Neoma Business School.

This phenomenon has been dubbed “Habsburg AI” by researcher Jathan Sadowski, as it recalls the negative effects of inbreeding within the Habsburg dynasty. And the experiments speak for themselves. For the scientific journal Nature, British and Canadian researchers trained an AI model to draw handwritten numbers based on a real dataset.

They then repeated the process, reusing the numbers generated by the AI at each stage. After 20 generations, the numbers had become blurred, and after 30, they had converged into a single indistinct form. “This study, published in 2024, shows that in just five generations of training on self-generated data, the system’s biases and flaws are already amplified,” Goudey says. “The variation decreases — that is, the diversity — but so does the accuracy of the responses.”

As for the text, the result is hardly more reassuring: Over the course of several iterations, a chatbot tasked with completing the sentence “To cook a turkey for Thanksgiving, you…” began reciting endless lists before getting bogged down in absurd repetitions: “… you need to know what you’re going to do with your life if you don’t know,” ad infinitum. ”The system tends to focus on the average, and outliers begin to disappear,” Goudey says. “This is known as early collapse, followed by late collapse, where responses become depleted and sometimes very far from reality.”

Robot lego figure couple. Credit: helloimnik/Unsplash

This degenerative process stems from a shortage of usable human data. The main models, such as ChatGPT, Gemini and Claude, have already been trained on most of the content available online. To continue growing, companies are turning to synthetic data, which is more abundant but also less expensive… and royalty-free. Except that this data is often of lower quality, which amplifies the risk of “inbreeding.” According to Goudey, “simply having 0.01% of contaminated data can lead to a drastic drop in performance, whether in images, text or video.”

Fragile defenses

When it comes to images, the standardization is visible to the naked eye. “We see a recurring yellow filter, which some attribute to the popularity of ‘Ghibli-style’ images that have spread on social media,” he explains. “But ultimately, it’s mainly a symptom of homogenization, loss of creativity, and increased bias” in generative AI models. Training a system also becomes more costly in terms of computing power and energy, while the diversity of results is reduced — a digital “impoverishment” that could undermine confidence in these tools. “It’s a bit like if the ninth edition of a tourist guide to the Île-de-France region only talked about the Arc de Triomphe and the Eiffel Tower,” Goudey says.

At present there is no way to guarantee that content is entirely human-generated

To avoid self-impairment, the most obvious solution remains to favor diverse human content. Some companies, such as OpenAI and Mistral AI, are already forming partnerships with image banks and news agencies. Others are focusing on detecting and tagging AI-generated content. “But at present, there is no way to guarantee that content is entirely human-generated,” Goudey warns. Watermarking, which involves inserting recognizable signals to identify AI-generated content, remains weak and easy to circumvent.

Other avenues are being explored: cleaning up the model training data sets by detecting and removing “contaminated” content, or compiling data sets of the highest possible quality from human sources.

“Major publishers, such as OpenAI working with the Associated Press and Mistral AI with AFP, are seeking to ensure that future generations of models will be trained on authentic data,” Goudey notes. “But it’s a race against time, because the phenomenon is an exponential one.” Without a course correction, AI could well enter a new era: One that is more biased and strangely monotonous.