When Wordplay Breaks AI: What Language Models Struggle to Spell

Author: Denis Avetisyan

New research explores how large language models perform on tasks requiring strict adherence to orthographic rules, revealing surprising weaknesses despite their fluency.

Model performance on puzzle solving demonstrates a capacity-dependent sensitivity to difficulty, with smaller models-like Qwen-4B, exhibiting a $19\times$ performance drop from easy to hard puzzles-showing far greater variation than larger models such as GPT-5-mini ($2.5\times$), though calibration strength remains modest across all models-ranging from $r=0.24$ to $0.38$-with proprietary models displaying slightly better correlation.

The study investigates alignment between human difficulty and model performance on orthographically constrained generation tasks, highlighting architectural importance over model scale.

While large language models excel at fluent text generation, satisfying strict structural constraints-like those imposed by word puzzles-remains a notable challenge. This research, titled ‘Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models’, systematically evaluates 28 model configurations across three families, revealing that architectural design matters more than parameter scaling for constraint satisfaction. Notably, models demonstrate a tendency to prioritize statistically common letter sequences over valid, yet unusual, orthography-a bias that persists even when human solvers easily resolve the puzzles. Could architectural innovations-beyond simply increasing scale-unlock more robust and human-aligned constraint satisfaction in future language models?

The Illusion of Fluency: Why Precision Falters

While Large Language Models demonstrate a remarkable capacity for generating human-quality text, their strength lies in statistical likelihood rather than absolute precision. This creates a paradox: models proficient at crafting fluent prose often falter when confronted with tasks requiring unwavering adherence to specific rules, such as orthographic constraints or formal grammar. The very mechanism enabling their natural language fluency – predicting the most probable continuation of a sequence – can lead to errors when a definitive, non-probabilistic answer is needed. For instance, a model asked to generate a word following strict letter restrictions may prioritize common word formations over rule compliance, revealing a fundamental tension between stylistic fluidity and rigid correctness. This limitation poses a significant challenge for applications where accuracy is paramount, demanding innovative approaches to bridge the gap between generative power and constrained output.

Large language models, while remarkably adept at producing human-like text, operate fundamentally by predicting the most probable continuation of a given sequence. This probabilistic core, however, introduces a critical limitation when faced with tasks demanding strict adherence to predefined rules. Instead of prioritizing absolute correctness – such as consistently following orthographic constraints or logical rules – these models tend to favor outputs that are statistically common in the training data. Consequently, a response might be grammatically fluent and contextually relevant, yet still violate a specific constraint because a statistically less likely, but correct, option was overshadowed by the model’s tendency to maximize distributional plausibility. This inherent trade-off between fluency and fidelity presents a significant hurdle in applications where precision is paramount, revealing a key area for ongoing research and development.

The ability of Large Language Models to adhere to rigid constraints – beyond simply generating statistically probable text – unlocks a wealth of practical applications currently beyond their reach. Consider code generation, where syntactical correctness is paramount; a single misplaced character renders the entire program unusable. Similarly, constrained writing tasks like creating crosswords, Sudoku puzzles, or even haikus demand absolute fidelity to specific rules. These scenarios aren’t merely about stylistic nuance; they require logical precision. The current limitations in handling such constraints represent a significant gap in LLM capabilities, hindering their potential in domains where correctness, not just fluency, is the primary objective and demonstrating a need for novel approaches to model training and architecture.

The prompt structure provides models with available letters, a designated center letter, and explicit constraints to assess intrinsic constraint-handling ability independent of memorization or output length expectations.

A Benchmark of Boundaries: The Spelling Bee Task

The Spelling Bee task is utilized as a benchmark for constraint satisfaction by presenting models with a defined set of letters – typically seven – and requiring the generation of all valid English words that can be formed using those letters, with the mandatory inclusion of a central letter. This task’s rigor stems from its combination of lexical recall – accessing a broad vocabulary – and adherence to strict orthographic constraints, namely valid word formation and letter usage. Performance is measured by the percentage of valid words correctly identified from the total possible solutions, providing a quantifiable metric for evaluating a model’s ability to satisfy multiple, simultaneously active constraints. The task inherently avoids trivial solutions by requiring all generated words to be of a minimum length, usually four letters.

The Spelling Bee task provides a quantifiable metric for evaluating a language model’s performance by assessing its capacity to integrate two distinct cognitive skills: lexical retrieval and orthographic rule adherence. Successful completion requires not only identifying valid words within a given letter set, but also ensuring those words conform to established spelling conventions. The metric derived from this task differs from simple accuracy measures; it allows for the evaluation of the trade-off between generating a high volume of potential words and maintaining a high degree of orthographic correctness, providing a nuanced understanding of a model’s linguistic competence beyond basic vocabulary knowledge.

Evaluation of models utilizes a zero-shot learning paradigm, meaning the models are not pre-trained or fine-tuned on any data specifically related to the Spelling Bee task or similar word puzzle formats. This approach is critical for isolating a model’s generalized constraint satisfaction abilities, as it eliminates performance inflation potentially caused by exposure to task-specific examples during training. By assessing performance without prior task adaptation, we gain insight into the model’s inherent capacity to apply learned linguistic and orthographic rules to novel problems, providing a more objective measure of its core capabilities and avoiding biases introduced by supervised learning on the target task.

Model performance on word recall degrades catastrophically with increasing word length, while human success declines more gradually.

Architectural Echoes: Insights from Model Benchmarking

A comparative benchmarking study was conducted using several Large Language Models (LLMs) to evaluate performance on the Spelling Bee Task. Models included in the assessment were Claude Haiku 4.5, GPT-5-mini, and the Qwen3 family of models. Results indicated a range of success rates across these LLMs, demonstrating variability in their ability to solve the task. The specific performance levels observed for each model are detailed in subsequent sections, alongside quantitative metrics. This initial evaluation established a baseline for comparison and highlighted the potential for performance differentiation among different LLM architectures and training paradigms.

The Qwen3 language model family, and specifically its implementations utilizing a Mixture of Experts (MoE) architecture, has demonstrated notable performance gains on the Spelling Bee Task. MoE models employ multiple “expert” networks, selectively activating only a subset for each input token, which increases model capacity without a proportional increase in computational cost. This architectural approach allows Qwen3 variants to more effectively satisfy the constraints inherent in the Spelling Bee task-identifying valid words from a limited character set-by dedicating specialized network components to different aspects of the problem. Initial benchmarking suggests that this specialization contributes to improved performance compared to dense models of comparable size.

Model performance on the Spelling Bee Task was quantitatively assessed using precision and recall metrics. Precision measures the proportion of correctly identified valid words among all words predicted by the model, while recall indicates the proportion of actual valid words that were correctly identified. Analysis reveals that proprietary language models consistently outperform open-source models, achieving F1 scores – the harmonic mean of precision and recall – that are 2.0 to 2.2 times higher. This disparity suggests significant differences in the ability of these models to both accurately generate valid words and comprehensively identify all possible solutions within the constraints of the task.

Proprietary models significantly outperform the largest open-source model across varying computational budgets, primarily due to substantially higher recall rates, while performance sensitivity to budget differs considerably between model families.

The Limits of Attention: Navigating the ‘Thinking Budget’

Constraint satisfaction – the process of finding solutions that adhere to specific rules – is fundamentally limited by the computational resources allocated to reasoning, a concept referred to as the ‘Thinking Budget’. This budget dictates how extensively a system can explore potential solutions and evaluate their validity. A larger Thinking Budget enables more exhaustive searches, allowing for the consideration of a broader range of possibilities and ultimately improving performance on complex tasks. Conversely, a restricted budget forces simplification and may lead to overlooking viable solutions, particularly in scenarios with numerous constraints or intricate relationships. Consequently, the effective allocation and utilization of computational resources represent a critical factor in determining the success of any system engaged in reasoning and problem-solving.

The capacity to effectively reason often hinges on the computational resources available, a principle reflected in the concept of a ‘Thinking Budget’. Models allocated a larger budget – meaning greater processing power and time – demonstrate an enhanced ability to explore a more extensive range of potential solutions during problem-solving. This expanded search space is critical for constraint satisfaction tasks, as it allows the model to systematically evaluate numerous possibilities and ultimately identify valid answers that might be missed with limited resources. Essentially, a generous ‘Thinking Budget’ functions as a broader cognitive toolkit, empowering the model to overcome challenges and achieve higher levels of performance by considering a greater diversity of options before arriving at a conclusion.

Analysis of performance on the Spelling Bee task reveals a notable link between a model’s ability to utilize its computational resources – its ‘thinking budget’ – and its success in finding valid solutions. The extent to which models engaged these resources correlated with task difficulty, but this relationship was markedly stronger in proprietary models, exhibiting correlation coefficients of 0.36 to 0.38, compared to open-source models, which showed a weaker correlation ranging from 0.24 to 0.26. This suggests that proprietary models are not only capable of leveraging more computational power, but also demonstrate a more nuanced and human-like approach to allocating these resources when tackling complex linguistic challenges, potentially mirroring the cognitive effort humans exert on similar tasks.

While smaller language models show no performance gain with increased computational budget and the 14B model surprisingly declines, proprietary systems consistently improve with more resources.

Echoes of Cognition: Towards Human-Level Constraint Satisfaction

A comparative analysis of model performance against human difficulty levels in the Spelling Bee task reveals a significant discrepancy, highlighting areas ripe for future development. While current models demonstrate proficiency on simpler iterations of the puzzle, performance sharply declines as word length and complexity increase – a degradation substantially more pronounced than that observed in human players. Specifically, the study indicates that smaller models experienced up to an 82-fold performance reduction when confronted with longer words, contrasted with a mere 1.3-fold decrease in human accuracy. This disparity underscores the need for continued research into more robust architectures and refined reasoning capabilities, suggesting that achieving human-level constraint satisfaction necessitates a deeper understanding of how humans approach these complex linguistic challenges and a corresponding ability to replicate those strategies within artificial systems.

Addressing the disparity between artificial and human constraint satisfaction necessitates a focused effort on both model architecture and computational resource management. Current approaches suggest that simply scaling model size isn’t a panacea; instead, innovative architectural designs – potentially incorporating mechanisms for more efficient knowledge representation and reasoning – hold significant promise. Simultaneously, strategic resource allocation, including optimized memory usage and parallel processing, can dramatically improve performance without necessarily requiring exponentially larger models. The goal is not merely to brute-force solutions, but to create systems capable of intelligent search and deduction, mirroring the human ability to quickly identify valid solutions within complex constraints, even with limited cognitive resources.

Investigations into constraint satisfaction tasks, such as the Spelling Bee game, reveal a significant disparity in performance scaling between artificial intelligence models and human cognition. While larger models demonstrate improved capabilities, smaller models experience a disproportionate performance drop – up to 82 times greater degradation on longer words compared to the 1.3x reduction observed in human players. This suggests that simply increasing model size isn’t a complete solution; instead, future research must prioritize architectural innovations and strategies for efficient resource allocation. Understanding how humans maintain reasoning proficiency across varying task complexity-and replicating those mechanisms in artificial systems-will be crucial for developing AI capable of true human-level constraint satisfaction, rather than merely brute-force computation.

The study illuminates a critical tendency within large language models: a prioritization of distributional plausibility over structural validity. This echoes a deeper truth about complex systems; they don’t simply solve problems, they adapt to them, often favoring what appears statistically likely over what is logically sound. As Alan Turing observed, “The imitation game” – now known as the Turing Test – isn’t about intelligence, but about indistinguishability. The models, much like the entities within Turing’s game, demonstrate a proficiency in mimicking patterns, even when those patterns mask underlying inconsistencies. The research suggests these systems, when faced with orthographic constraints, aren’t necessarily understanding the rules, but rather predicting likely sequences. It isn’t about correctness, but about convincing resemblance – a prophecy of future failure when true reasoning is required.

The Path Ahead

The pursuit of constrained generation reveals, once again, that scale alone does not confer understanding. These models demonstrate a predictable bias: distributional plausibility, a mimicry of statistical adjacency, consistently outweighs structural validity. It is a comfortable illusion, and one easily exploited. The discrepancies observed between architectural families suggest that the shape of the system-the pathways carved for information-matters more than the sheer volume of data flowing through them. This is not a surprising revelation, merely a confirmation that architecture isn’t structure-it’s a compromise frozen in time.

Future work will undoubtedly explore increasingly complex constraint scenarios, but the fundamental challenge remains: how to imbue these systems with a genuine appreciation for rules, rather than a probabilistic approximation thereof. The focus should shift from achieving constraint satisfaction to modeling the cognitive processes underlying it. Technologies change, dependencies remain. The allure of ever-larger models obscures the enduring need for principled, cognitively-informed design.

Ultimately, this research serves as a quiet reminder: these are not problem-solvers, but pattern-completers. They excel at predicting what should come next, given a vast catalog of precedents. But faced with genuine novelty-a constraint truly outside the distribution-they falter, revealing the limits of a purely statistical worldview. The ecosystem will continue to evolve, but the underlying paradox-the tension between prediction and understanding-will persist.

Original article: https://arxiv.org/pdf/2511.21086.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/