57% Of The Internet Could Be AI Generated Content

You’re not imagining things; the quality of search results is indeed declining. Recent research from Amazon Web Services (AWS) indicates that a staggering 57% of online content is either generated by artificial intelligence or translated through AI algorithms.

The research paper, titled “A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism,” points to the widespread use of low-cost machine translation (MT) as a major factor. This method reprocesses existing content into multiple languages, which the study identifies as a significant contributor to the overwhelming volume of online material now deemed to be AI-generated. According to the researchers, “Machine-generated, multi-way parallel translations not only dominate the translated content on the web in low-resource languages where MT is prevalent, but they also make up a substantial portion of all web content available in those languages.”

The researchers also uncovered a type of selection bias in the content that is translated into multiple languages versus that which remains in a single language. “Content that is translated into multiple languages tends to be shorter, more predictable, and has a different distribution of topics compared to single-language content,” they noted.

Furthermore, the proliferation of AI-generated content, along with an increasing dependence on AI tools to alter and refine such content, could instigate a phenomenon known as model collapse. This situation is already impacting the quality of search outcomes across the web. Advanced AI models, such as ChatGPT, Gemini, and Claude, rely heavily on vast amounts of training data sourced from the public web—regardless of potential copyright violations. With the internet becoming saturated with AI-generated content, which is frequently erroneous, the performance of these models could significantly deteriorate.

“The speed at which model collapse occurs can be quite surprising, and it often goes unnoticed,” warned Dr. Ilia Shumailov from the University of Oxford in an interview with Windows Central. “Initially, it primarily affects minority data sets—those that are poorly represented. Over time, this issue reduces the diversity of outputs, which might create a false impression of improved performance on majority data while masking deterioration in minority data. Model collapse can lead to significant repercussions.”

The consequences of this could be illustrated by a study in which professional linguists categorized 10,000 randomly selected English sentences from various topics. The findings displayed a “dramatic change in topic distribution” when comparing translations from two languages to those spanning eight or more languages, particularly a surge in the “conversation and opinion” category, which jumped from 22.5% to 40.1%.

This discovery underscores the selection bias present in the types of data given multiple translations—these are “substantially more inclined” to fall under the “conversation and opinion” topic.

Additionally, the study showed that “translations involving more than eight languages are significantly of lower quality (6.2 Comet Quality Estimation points lower) than those involving only two languages.” An audit of 100 of these highly multi-way translations revealed that “the vast majority” originated from content farms, producing articles deemed low quality and requiring minimal expertise or effort to generate.

This trend helps clarify why OpenAI’s CEO, Sam Altman, continually emphasizes the necessity of unrestricted access to copyrighted materials for creating tools akin to ChatGPT, stating it is “impossible” without it.