The internet is drowning in AI-generated content, and it’s not just a matter of annoying spam. A recent study by Amazon Web Services (AWS) researchers revealed a staggering truth: a whopping 57% of online content is either AI-generated or machine-translated. This alarming trend, fueled by the accessibility of low-cost AI translation tools, is having a devastating impact on the quality of information available online.
The study, titled “A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism,” pinpoints the problem: machine translation, while convenient, often creates low-quality content that lacks depth and nuance. This is particularly true for languages with fewer resources, where machine translation is often the only option. The researchers found that machine-translated content is shorter, more predictable, and leans towards “conversation and opinion” topics compared to content published in a single language.
But the issue goes beyond just low-quality content. The sheer volume of AI-generated content, coupled with the increasing reliance on AI for editing and manipulation, poses a significant threat to the internet’s information ecosystem. This phenomenon, known as “model collapse,” can seriously degrade the performance of advanced AI models like ChatGPT, Gemini, and Claude.
These models learn from massive amounts of data, which they primarily acquire by scraping the internet. Imagine feeding these models a diet of low-quality, machine-translated content—it’s like teaching a child to speak by only exposing them to gibberish. The result is a decline in the accuracy and quality of their outputs. This is a serious concern, as AI models are increasingly being used in various fields, from education and healthcare to finance and law.
Dr. Ilia Shumailov from the University of Oxford warns about the insidious nature of model collapse: “It is surprising how fast model collapse kicks in and how elusive it can be. At first, it affects minority data—data that is badly represented. It then affects diversity of the outputs and the variance reduces. Sometimes, you observe small improvement for the majority data, that hides away the degradation in performance on minority data. Model collapse can have serious consequences.”
The researchers further demonstrated the impact of this AI-driven content flood on search results. They analyzed 10,000 English sentences from various categories and found a significant shift in the distribution of topics between content translated into two languages versus content translated into eight or more. The latter category had a significantly higher proportion of “conversation and opinion” topics, highlighting the selection bias inherent in machine translation. Additionally, they discovered that content translated into multiple languages was significantly lower in quality compared to content translated into just two languages.
This study serves as a stark reminder of the unintended consequences of AI’s rapid advancements. While AI tools can be beneficial, their unchecked proliferation is leading to a deterioration of the internet’s information landscape. As we move forward, it’s crucial to address the ethical concerns surrounding AI and ensure that its development prioritizes quality, accuracy, and diversity over quantity and speed.