Artificial general intelligence won't emerge as a result of LLMs alone. These models are better thought of as giant simulations, where the network learns to recognize and produce the next state of a given generative system and initial parameters. That doesn't mean they aren't game changing - there is great power in prediction; but more than that is needed to realize artificial general intelligence, as you have to actually interact with the environment, not just generalize from existing data.
I don't disagree with the idea that we're getting closer, though, and that yes, it's the reason the West has been hell bent on denying China - and anyone else that isn't a Western vassal - access to the hardware foundations of this technology. But I also don't think geniuses are what will win this; the brute force approach to current AI research indicates that R&D infrastructure is what will determine the winner. It's not about mathematical theories, because the level of complexity is already beyond mathematicians' ability to analyze. The ability to rapidly prototype, scale, and iterate compute infrastructure seems much more important.
While the idea that AGI won't emerge from language models alone has merit, recent developments suggest we're closer to AGI than many realize. The path forward isn't just about larger models, but smarter integration of multimodal capabilities.
Take, for instance, the recent release of QWEN-VL by Alibaba. This model, with a mere 2 billion parameters, can process and understand 20-minute videos, engaging in meaningful dialogue about their content. What's more, Alibaba is already working on an "Omni" version that will incorporate audio alongside vision and language, with applications ranging from virtual NPCs to physical robots.
This rapid progress in multimodal AI suggests that the current transformer-plus-attention architecture, when extended to encompass multiple sensory inputs and outputs, may indeed be sufficient to achieve a rudimentary theory of mind. We're not just predicting next tokens anymore; we're creating systems that can perceive, understand, and interact with complex environments.
I posit that we already have the fundamental technology needed for AGI. What we need now is to apply these technologies in innovative ways, particularly in developing what I call an "Omni-human model." This model would extend current multimodal systems to include proprioception, motor control, and real-time environmental interaction.
Imagine an AI that combines the reasoning capabilities of large language models, the perceptual understanding of vision-language models like QWEN-VL, and the ability to learn from and interact with its environment in real-time. By implementing differentiable inverse kinematics solvers and advanced reinforcement learning frameworks within this multimodal architecture, we could create AI systems that don't just predict, but actively engage with their surroundings.
This Omni-human model could demonstrate unprecedented behavioral consistency, adaptability, and natural interactions in complex virtual environments. And while my focus is primarily on virtual worlds, the principles could easily extend to physical robots, bridging the gap between digital and physical realms.
The key here isn't just raw computing power or mathematical theories. It's about creating a holistic system that mirrors the human mind-body connection. With the rapid advancements we're seeing in multimodal AI, I believe we're on the cusp of achieving AGI through this integrated approach.
In essence, while traditional language models alone may not lead to AGI, the extension of these models into truly multimodal, embodied systems - as exemplified by QWEN-VL and the proposed Omni-human model - could be the key to unlocking artificial general intelligence in the near future. We have the tools; now it's time to assemble them in the right way.