Why Some AI Can Reason, and Others Can’t

Author: Denis Avetisyan


New research reveals a fundamental architectural limitation in decoder-only models when it comes to complex causal reasoning and adapting to changing data.

The study demonstrates that BERT, leveraging an encoder-only architecture, exhibits consistent geometric transformations throughout multi-step reasoning-a preservation of logical structure-while decoder-only models like Qwen display increasing curvature drift, suggesting a breakdown in that same logical consistency, findings which corroborate a geometric framework for understanding reasoning processes as described by zhou2025geometry.
The study demonstrates that BERT, leveraging an encoder-only architecture, exhibits consistent geometric transformations throughout multi-step reasoning-a preservation of logical structure-while decoder-only models like Qwen display increasing curvature drift, suggesting a breakdown in that same logical consistency, findings which corroborate a geometric framework for understanding reasoning processes as described by zhou2025geometry.

Encoder-based architectures demonstrably outperform decoder-only large language models in tasks demanding compositional depth and reliable logical inference, particularly under distributional shift.

Despite recent advances in large language models, reliable performance on tasks demanding compositional reasoning remains a significant challenge. This is the central question addressed in ‘Causal Reasoning Favors Encoders: On The Limits of Decoder-Only Models’, a study investigating architectural biases in causal inference. Our findings demonstrate that encoder-based models consistently outperform decoder-only architectures in scenarios requiring multihop logical deduction, particularly when faced with distributional shifts. This raises a critical question: can architectural innovations focused on latent space representation offer a pathway towards more robust and cost-effective artificial general intelligence?


Unraveling Intelligence: Beyond Pattern Recognition

The pursuit of truly robust artificial intelligence extends far beyond the capabilities of simple pattern recognition. While current AI excels at identifying correlations within datasets, genuine intelligence necessitates an understanding of causation – the relationship between cause and effect. An AI limited to pattern recognition may reliably predict that a specific input leads to a certain output, but it cannot explain why, nor can it adapt effectively when faced with novel situations or interventions. Consequently, systems lacking causal reasoning are brittle and prone to failure when confronted with changes in their environment, highlighting the critical need for AI that can not only observe but also understand the underlying mechanisms driving the phenomena it encounters.

Logical deduction serves as the bedrock of causal reasoning, representing a formalized system by which conclusions are reliably drawn from established premises. This process isn’t simply about identifying correlations; rather, it demands a structured approach where inferences follow necessarily from given information. Consider a syllogism – a classic example of deduction – where a general statement, combined with a specific instance, yields a logically certain conclusion. This framework, often expressed using symbolic logic with statements and quantifiers, allows for the unambiguous evaluation of arguments and the elimination of subjective interpretation. By adhering to strict rules of inference – such as modus ponens or modus tollens – deduction provides a robust mechanism for establishing the validity of causal claims, distinguishing them from mere speculation or observed patterns. The power of deduction lies in its guarantee: if the premises are true, the conclusion must also be true, forming a crucial component in building truly intelligent systems.

The capacity for complex reasoning isn’t simply about possessing individual logical skills, but rather the ability to connect them sequentially – a process known as multi-hop composition. This involves chaining multiple deductive steps together, where the conclusion of one becomes the premise for the next, and so on, until a final answer is reached. Consider a scenario requiring inference about object permanence; a system must not only understand that an object exists, but also trace its trajectory even when hidden from view. This demands a series of connected inferences – ‘the object was here, it moved to there, even though it’s obscured, it still exists’ – showcasing how reasoning transcends single-step logic. Successfully implementing multi-hop composition is therefore critical for artificial intelligence systems aiming to tackle nuanced problems and exhibit genuinely intelligent behavior, as it allows for the exploration of more intricate relationships and the derivation of conclusions from complex, layered information.

Our approach involves creating a training dataset alongside out-of-distribution test sets to evaluate zero- and few-shot inference, as well as fine-tuning performance across various model architectures including BERT, BART, and Qwen3-1.7B.
Our approach involves creating a training dataset alongside out-of-distribution test sets to evaluate zero- and few-shot inference, as well as fine-tuning performance across various model architectures including BERT, BART, and Qwen3-1.7B.

Deconstructing the Architectures of Reason

Encoder-based models, such as BERT, utilize a multi-layer transformer architecture to process input text and generate contextualized embeddings. This process involves projecting the input tokens into a high-dimensional vector space, known as the latent space, where semantic relationships between words are captured. The transformer layers employ self-attention mechanisms, allowing each token to attend to all other tokens in the input sequence, thereby establishing contextual understanding. The resulting latent representation effectively encodes the meaning of the input, facilitating downstream tasks like text classification, named entity recognition, and question answering. The dimensionality of this latent space is a key hyperparameter, typically ranging from several hundred to over a thousand dimensions, influencing the model’s capacity to capture nuanced semantic information.

Decoder-only models, like Qwen3-1.7B, operate by predicting the next token in a sequence given the preceding tokens, a process repeated recursively to generate outputs. This autoregressive approach enables sequential reasoning as the model maintains an internal state representing the generated text thus far, influencing subsequent token predictions. Each newly generated token becomes part of the input context for the following prediction, allowing the model to build coherent and contextually relevant responses. The probability of each token is determined by a learned distribution conditioned on the preceding sequence, and sampling strategies, such as top-$p$ or temperature scaling, can be employed to control the diversity and quality of the generated output.

Encoder-decoder models, such as BART, leverage the contextual understanding capabilities of encoders – which project input into a latent space – and the sequential generation abilities of decoders to perform both comprehensive input analysis and nuanced output creation. While these models offer a combined advantage, recent performance benchmarks indicate that finetuned encoder-based models are highly competitive; achieving accuracy rates of 65-76% on the Natural Language dataset. This suggests that, despite the architectural benefits of encoder-decoder structures, optimized encoder-only models can deliver comparable, and in some cases superior, results on specific reasoning tasks.

Decoder-only models demonstrate varying performance across depth-wise metrics when applied to the NL Dataset.
Decoder-only models demonstrate varying performance across depth-wise metrics when applied to the NL Dataset.

The Logic of ‘And’: Conjunctive Control and Intelligent Systems

Conjunctive control represents a critical reasoning capability wherein a positive result is contingent upon the simultaneous satisfaction of multiple conditions. Unlike disjunctive or inclusive reasoning, where only one or some conditions need be met, conjunctive problems demand all specified criteria are true for a successful outcome. This type of reasoning is prevalent in diverse real-world scenarios, including legal compliance, medical diagnosis, and logistical planning; for example, a loan application might only be approved if the applicant meets requirements for income, credit score, and employment history. The necessity of evaluating multiple predicates simultaneously makes conjunctive control a challenging task for artificial intelligence systems, requiring robust logical inference capabilities to ensure accurate and reliable decision-making.

Conjunctive control relies on logical deduction to ascertain the truth of a conclusion based on multiple premises. This process necessitates evaluating whether all conditions within a conjunction are satisfied; if any single condition is false, the entire conjunction is considered false. The validity of inferences within conjunctive control is directly tied to the application of deductive reasoning rules, such as modus ponens and modus tollens, which ensure that conclusions follow logically from the given premises. Consistent application of these rules guarantees the reliability of the system’s output, preventing erroneous conclusions arising from flawed reasoning or incomplete information. Without sound deductive principles, a system attempting conjunctive control cannot accurately assess the combined truth value of multiple conditions.

The integration of conjunctive control into Large Language Models (LLMs) is crucial for developing AI systems perceived as reliable and trustworthy. Current evaluations demonstrate that finetuned encoder-based models achieve an Area Under the Curve (AUC) of 0.60 to 0.76 on tasks requiring conjunctive reasoning, indicating a significant ability to correctly discriminate between valid and invalid inferences. Further analysis reveals that models like BERT exhibit greater and more consistent curvature similarity across increasing reasoning depths; this suggests these models maintain a more stable internal representation of the logical transformations necessary for accurate conjunctive control, unlike models that may degrade in performance with complex chains of reasoning.

The study’s findings regarding compositional causal reasoning echo a sentiment shared by the mathematician Carl Friedrich Gauss: “If others would think as hard as I do, they would not have so little to think about.” This isn’t merely about intellectual effort, but about the architecture of thought itself. The paper highlights how decoder-only models struggle with the ‘curvature’ inherent in complex logical problems-a challenge encoder-based systems navigate with greater ease. It suggests that a model’s ability to reliably dissect and reassemble information – much like a rigorous mathematical proof – isn’t simply a matter of scale, but of foundational design. The limitations revealed in decoder-only models aren’t shortcomings to be ‘fixed’, but rather invitations to explore alternative architectures that better mirror the underlying mechanics of logical inference.

What’s Next?

The demonstrated disparity in causal reasoning capabilities isn’t merely a performance gap; it’s an architectural confession. Decoder-only models, impressive as they are in generating continuations, reveal a fundamental limit when pressed to truly understand the relationships governing those continuations. The study highlights that scaling alone won’t solve a problem rooted in how information is initially processed. Future work must move beyond assessing what these models get wrong, to dissecting why the encoder’s initial state-building phase proves so critical for robust inference.

Distributional shifts, predictably, serve as the stress test. A system that hasn’t genuinely internalized causal structure will inevitably falter when the surface statistics change. This begs the question: can a decoder-only architecture be retrofitted with mechanisms to simulate an encoding step, or is the very principle of autoregressive generation inherently biased against compositional reasoning? The pursuit of “causal priors” embedded within model weights feels increasingly urgent – a search for the minimal scaffolding necessary for logical competence.

Ultimately, the best hack is understanding why it worked. The encoder’s advantage isn’t just about better performance; it’s about revealing the inherent trade-offs in these architectures. Every patch, every architectural innovation, is a philosophical confession of imperfection. The next frontier isn’t simply building bigger models, but building models that demonstrably know what they don’t know.


Original article: https://arxiv.org/pdf/2512.10561.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-15 02:32