Language Models are NOT the future.
We Need More Than Language Models for Artificial General Intelligence

The rapid advancement of Large Language Models (LLMs) has undoubtedly pushed the boundaries of AI, sparking discussions about the potential for achieving Artificial General Intelligence (AGI), leading many to believe that AGI is within reach. Their ability to generate human-quality text, translate languages, and answer complex questions has fueled claims of "sparks of AGI" and ignited the imaginations of both experts and the public. However, despite their impressive linguistic abilities, LLMs face significant challenges that hinder their path towards true, human-like intelligence. This publication explores these challenges, drawing on insights from leading AI researchers and cognitive scientists, and advocates a future for AGI (human-level AI as Yann LeCun prefers to call it) that goes beyond LLMs, focusing on embodied, predictive models trained with video data, such as Video-Joint Embedding Predictive Architectures (V-JEPAs) or Sora by OpenAI.
The Cracks in the Facade: LLM Shortcomings
LLMs excel in the realm of formal linguistic competence, as evidenced by their performance on benchmarks like BLiMP and SyntaxGym. They demonstrate an impressive grasp of grammatical rules, hierarchical structures, and even abstract linguistic concepts. However, their functional competence – the ability to use language effectively in the real world – often falls short. This is due to several key limitations:
The Illusion of Agency and the Problem of Meaning
One of the core arguments against LLMs achieving AGI centers around the concept of "agency". In his paper "Artificial Intelligence is Algorithmic Mimicry", Johannes Jaeger argues that LLMs lack the ability to set their own goals and act autonomously. While they can process information and generate outputs based on patterns in their training data, they remain bound by the instructions and objectives imposed by their creators. This lack of agency, a fundamental characteristic of living beings, restricts LLMs from achieving true general intelligence.
Furthermore, Jaeger highlights the problem of "embodiment" in LLMs. Unlike living organisms, which seamlessly integrate their physical and symbolic aspects, LLMs exist in a purely symbolic realm, detached from the physical world. Their understanding of the world is limited to the data they've been trained on, preventing them from interacting with and learning from the physical environment in a truly embodied manner.
The Limitations of a Small World
The disconnect between LLMs and the physical world leads to another crucial distinction: the difference between "large" and "small" worlds. Jaeger draws on the work of Leonard Jimmie Savage to explain that LLMs exist in a "small world", a closed system defined by their code, data, and computational environment. Every aspect of this world is pre-defined and lacks the ambiguity and complexity of the real world. In contrast, humans and animals navigate a "large world", where information is often incomplete, uncertain, and even misleading. This necessitates the ability to define and solve problems that are ill-defined, a skill that LLMs, confined to their small world, do not possess.
Impressive Mimicry, Limited Reasoning
LLMs excel at tasks involving pattern recognition and text generation, demonstrating remarkable formal linguistic competence. They can translate languages, summarize information, and even write different kinds of creative content. Yet, as highlighted in "Large Language Models Still Can’t Plan", they struggle with tasks that require reasoning, planning, and understanding the consequences of actions. This limitation is further explored in "Dissociating Language and Thought in Large Language Models", where researchers argue that LLMs, while adept at formal linguistic skills, often fall short in functional competence – the ability to use language effectively in real-world situations.
This discrepancy stems from LLMs' limited world models and lack of grounding in physical experience. As Yann LeCun points out in his paper, "A Path Towards Autonomous Machine Intelligence", LLMs are primarily trained on text data, resulting in a shallow understanding of the underlying reality. They excel at mimicking human language but lack the ability to reason, plan, and adapt their behavior based on real-world interactions.
He argues that their reliance on reproducing human-generated data, primarily text, without the ability to search, plan, and reason, will inevitably lead to a saturation point in performance. This means that LLMs, even with massive increases in data and model size, will plateau below or around human-level competence. Additionally, reaching this saturation point will require a far greater number of training trials compared to humans, who learn efficiently through interaction and experience.He critiques their approach as essentially "memorizing lots of problem statements together with recipes on how to solve them". LLMs, according to LeCun, attempt to solve new problems by retrieving and applying these memorized recipes without truly understanding the context or reasoning about the potential consequences. This "recipe-following" approach highlights the lack of true understanding and reasoning capabilities in LLMs. They may excel at mimicking human language and completing tasks within their training domain, but they struggle to generalize their knowledge to new situations and adapt their behavior based on changing contexts.
Embodiment: Grounding AI in the Physical World
A crucial aspect of achieving AGI is "embodiment", as emphasized by both LeCun and Jaeger. Intelligent agents should not be confined to the symbolic realm but should be able to perceive and act upon the world through sensors and actuators. This allows for learning from experience, adapting behavior based on real-time feedback, and developing a deeper understanding of the environment.
The Promise of V-JEPAs: Learning World Models from Video
Yann LeCun, a leading AI researcher and Chief AI Scientist at Meta, argues that simply scaling up LLMs will not lead to AGI. Instead, he proposes a shift towards "Objective-Driven AI" systems that learn world models through self-supervised learning and interaction with the environment. Central to his vision is the Joint Embedding Predictive Architecture (JEPA), which learns abstract representations of the world, enabling prediction, reasoning, and planning even under uncertainty.
Building upon this concept, researchers have developed V-JEPAs, which learn world models from video data. As described in "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" and "Revisiting Feature Prediction for Learning Visual Representations from Video", V-JEPAs, an extension of JEPAs, learn to predict the representations of various target blocks in a video sequence based on a single context block. This approach allows them to capture temporal dependencies, understand object interactions, and develop a richer understanding of how the world works compared to LLMs trained on static text data.
V-JEPAs, by learning from video data, inherently incorporate aspects of embodiment, as they capture the dynamic nature of the physical world. This makes them better suited for tasks that require planning, reasoning about actions, and understanding the consequences of those actions.
V-JEPAs, hold significant promise for advancing towards human-level AI. By learning from video data, they capture the dynamic nature of the physical world and develop a richer understanding of object interactions and temporal dependencies. This approach inherently incorporates aspects of embodiment, addressing one of the key limitations of LLMs.
Advantages of V-JEPAs over LLMs
Learning from Dynamic Input: Video data provides a richer source of information about the world compared to static text. V-JEPAs can learn about object movements, interactions, and cause-and-effect relationships, developing a more comprehensive understanding of how the world works.
Incorporating Embodiment: By learning from video, V-JEPAs implicitly capture aspects of embodiment, as they learn to predict and reason about changes in the physical world. This grounding in physical experience is crucial for achieving true intelligence.
Potential for Planning and Reasoning: The world models learned by V-JEPAs can be used for planning and reasoning about actions and their consequences. This addresses a major limitation of LLMs, which struggle with tasks requiring multi-step reasoning and planning.
A Future Beyond LLMs: Towards Embodied, Predictive Intelligence
While LLMs have played a significant role in advancing AI research, they are not the ultimate solution for achieving AGI. Their limitations in reasoning, planning, and understanding the physical world necessitate a shift towards embodied, predictive models that can learn and adapt based on real-world interactions.
V-JEPAs, with their ability to learn world models from video data, offer a promising path forward. By incorporating principles of embodiment, self-supervised learning, and predictive modeling, V-JEPAs can bridge the gap between language and thought, enabling AI systems to interact with the world in a more meaningful and intelligent manner. This approach, combined with further research into areas like core knowledge, intrinsic motivation, and hierarchical planning, holds the potential to unlock the next generation of AI, moving us closer to the realization of true Artificial General Intelligence.
Bibliography
Artificial Intelligence is Algorithmic Mimicry: Why artificial “agents” are not (and won’t be) proper agents by Johannes Jaeger. This paper argues that the likelihood of LLMs achieving agency, and thus AGI, is infinitesimally small due to fundamental differences in organization and embodiment between living and algorithmic systems.
A Path Towards Autonomous Machine Intelligence by Yann LeCun. This paper proposes a novel architecture for intelligent agents that emphasizes the importance of world models, self-supervised learning, and hierarchical planning to achieve AGI.
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture by Mahmoud Assran et al. This paper introduces the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images, as a step towards learning world models.
Large Language Models Still Can’t Plan (A Benchmark for LLMs on Planning and Reasoning about Change) by Karthik Valmeekam et al. This paper presents a benchmark for evaluating the planning and reasoning capabilities of LLMs and shows that they exhibit subpar performance on tasks requiring reasoning about actions and change.
Dissociating language and thought in large language models: A preprint by Kyle Mahowald et al. This paper distinguishes between formal linguistic competence and functional linguistic competence in LLMs and argues that while they excel at the former, they often fall short on tasks requiring the latter.
lecun-20240328-harvard.pdf This presentation by Yann LeCun at Harvard University discusses the limitations of current AI systems, particularly LLMs, and advocates for a shift towards Objective-Driven AI systems that learn world models and have the ability to reason and plan.