TMTPOST--The relentless expansion of AI large language models (LLMs) in line with the Scaling Law—driven by increasing training data, computational power and parameters—may soon hit its ceiling, warns Xiao Yanghua, a professor at Fudan University's School of Computer Science and Technology, and Director of the Shanghai Key Laboratory of Data Science.
"We need to reassess the Scaling Law. Moreover, we must address these issues at their root, to stimulate the core cognitive abilities of large models and enhance their level of rationality," Xiao noted.
At the 2024 Inclusion Conference on the Bund held from September 5 to 7, the "From Data for AI to AI for Data" forum projected that by 2026, the amount of new data generated by humans will fall behind the data needed for training models. By 2028, it is estimated that AI LLMs will deplete human-generated data.
This raises concerns that future models, whether using high-quality open data or information from the internet, will eventually reach a "bottleneck," making it difficult to achieve superhuman general artificial intelligence (AGI).
Xiao emphasized that the crux of AI model implementation lies in data engineering. However, the current development of large models is characterized by "crude" data consumption and inefficient usage, far inferior to human data processing. Furthermore, he pointed out that much of the data used in these massive models could be seen as "fluff," suggesting that we are already approaching the point where AI LLMs have exhausted useful data.
The rapid expansion of LLMs has led to an increase in the scale of data consumption. For instance, Meta's open-source model Llama 3 reportedly uses 15 trillion tokens, more than 200 times the size of the Library of Alexandria, which held approximately 70 GB of data. OpenAI's GPT-3.5 utilized 45 terabytes of text data, equivalent to 4.72 million copies of China's Four Great Classical Novels. GPT-4 went further, incorporating multi-modal data with a scale of hundreds of trillions of tokens.
Despite the impressive capabilities these models demonstrate, they still face significant challenges, including the infamous "hallucinations" and lack of domain-specific knowledge. OpenAI's GPT-4, for example, has an error rate of over 20%, largely due to the absence of high-quality data.
Xiao highlighted that data quality determines the "intelligence ceiling" of AI LLMs. Yet, around 80% of the data used in large-scale models may be redundant or erroneous, making the refinement of data quality and diversity critical for the future development and application of AI technology.
Xiao outlined three potential pathways to improve AI LLMs through high-quality data: synthetic data, private data, and personal data.
Xiao also believes that the current reliance on expanding model parameters—often with redundant information—may soon reach its limits. He advocates for a shift towards smaller, more refined models that retain only the most critical data, allowing AI to achieve higher levels of rationality and intelligence.
He argues that the current surge in generative AI models is a bubble that will inevitably burst, as the growth in high-quality data production is relatively slow. The challenges of controlling synthetic data quality and the limits of deductive reasoning will also cap AI's potential. Even if models are trained with parameters ten or a hundred times the size of the human brain, the limits of human cognition may prevent us from fully understanding or utilizing such superintelligent systems.
Ultimately, Xiao sees AI as a "mirror" that forces humanity to confront the things that lack value in society, pushing human beings to focus on what truly matters. He concludes that AI's future will compel industries to return to their core values and drive humans to pursue more meaningful and valuable endeavors.
As the AI field continues to evolve, the debate over data quality, scaling limits, and the role of synthetic data will shape the next phase of development. But one thing is certain: the road to AGI will be paved with challenges that extend beyond mere data accumulation.