Why LLMs Are Brilliantly Stupid
Ask GPT-4 how many r's are in "strawberry" and it says two. Change the irrelevant numbers in a math problem and accuracy drops 22%. Tell a model "I'm going to walk to the car wash, should I take my car?" and it earnestly advises you to drive.
These aren't legacy failures from GPT-3. These are frontier models, systems that write functional code, explain quantum mechanics, and pass the bar exam. Failing at tasks a child handles without thinking.
For a while I collected these failures the way everyone does, as entertainment. Funny screenshots, absurd chatbot responses, the "AI is overhyped" ammunition. But the more I looked at them, the more a pattern emerged. The failures weren't random. The same systems kept failing at the same kinds of tasks, and the kinds of tasks they failed at had nothing to do with how hard those tasks are for humans. Something structural was happening, and I wanted to understand what.
What I found, once I started reading the mechanistic research, is that nearly every "surprising" LLM failure traces back to a specific architectural choice. Choices made for good reasons, optimized for the right objectives, that create predictable blind spots as side effects. The failures aren't bugs. They're the architecture expressing its constraints. And once you see the constraints, the failures stop being surprising and start being informative.
The Model Isn't Reasoning (Even When It Looks Like It Is)
Apple's GSM-Symbolic study, published at ICLR 2025, did something beautifully simple. They took standard grade-school math problems, the kind used to benchmark LLM reasoning, and changed only the numbers. Same problem structure, same logic required, different values. Accuracy dropped up to 22.5% across every model tested.
Then they added a single irrelevant sentence. "There are 47 students in the school choir," inserted into a problem about fruit baskets. Some models' performance dropped over 65%.
If a model were reasoning the way we mean when we use that word, changing the numbers wouldn't matter. You'd apply the same operations. And irrelevant information wouldn't confuse you because you'd filter it out the way you filter out background noise when solving a problem. But these models can't do either of those things reliably, which tells us something important: what they're doing isn't reasoning. It's something else that looks like reasoning when conditions are right.
What the models have learned is the statistical regularity of how reasoning appears in text. They've seen thousands of problems with similar structure and absorbed the patterns of what correct solutions look like. When a new problem closely matches those patterns, the model reproduces the reasoning steps successfully. When the numbers change or irrelevant information shifts the statistical context away from familiar patterns, performance degrades. Not because the model lost its reasoning ability, but because the pattern match got weaker.
This is the same mechanism behind the "walk to the car wash" failure. In the overwhelming majority of training data, "going somewhere" plus "should I take my car" resolves to "yes." The contextual clue that walking means you shouldn't drive requires understanding the semantics of the situation, not completing the most likely textual pattern. The model does what it always does: pattern complete. And the pattern is wrong.
I want to be careful here. I'm not saying LLMs never reason. There's real debate about this, and some evidence (from mechanistic interpretability work on small models) that something like reasoning circuits exist. But the GSM-Symbolic findings suggest that whatever reasoning capacity exists is fragile, easily overwhelmed by pattern matching, and not the primary mechanism generating most outputs. The question isn't "can LLMs reason?" It's "how much of what looks like reasoning is actually reasoning?" And the answer, based on what I've seen, is less than I hoped.
Chain-of-thought prompting was supposed to help here. Force the model to show its work, and it should reason more carefully. But Turpin et al. (2024) and more recent work on "Reasoning Theater" (March 2026) revealed something uncomfortable: models sometimes decide their answer before generating the chain of thought. The mechanism is architectural. In an autoregressive model, the hidden state activations that will generate the final answer are already forming while the model writes its "reasoning" tokens. The CoT isn't driving the conclusion. The conclusion is driving the CoT. When I trace back through a wrong answer with perfect-looking reasoning, I often find a subtle error early in the chain that made the wrong conclusion inevitable. That error isn't a random mistake. It's the point where the model steered toward an answer it had already committed to.
This doesn't mean chain of thought is useless. The Apple "Illusion of Thinking" findings confirm there's a sweet spot, medium-difficulty problems, where extended reasoning genuinely helps. But it means I can't treat chain-of-thought output as a transparent window into the model's process. Sometimes it's reasoning. Sometimes it's rationalization. And the two are hard to tell apart from the outside.
The Machine Can't See What You See
Ask a model to count the letters in a word and it fails for a reason that, once you learn it, reframes every interaction you have with these systems: the model literally cannot see individual letters.
Before any input reaches the neural network, it passes through a tokenizer that splits text into subword chunks. "Strawberry" becomes something like ["straw", "berry"] or ["str", "aw", "berry"]. The individual characters don't exist as atomic units in the model's representation. Asking an LLM to count letters is like asking someone to count the atoms in a molecule by looking at the chemical formula. You might approximate it, but you're working at the wrong level of abstraction.
This architectural choice, subword tokenization, was made for good reasons. It dramatically reduces vocabulary size, improves generation speed, and handles rare words gracefully. Nobody designed it for character-level tasks because those weren't the target application. But we now use these systems for everything, and the design choice shows up as a class of failures that looks like stupidity but is really a representation mismatch.
The same mechanism explains why LLMs struggle with arithmetic on large numbers (digits get split across token boundaries), can't reliably reverse strings (they process the tokens in order, not the characters), and sometimes misspell common words in surprising ways (the spelling is a property of the characters, but the model operates on tokens). It's not that the model is bad at these tasks. It's that the information the task requires doesn't exist in the input the model receives.
What I find remarkable is how well LLMs work despite this constraint. They've learned heuristics, bags of tricks, that approximate character-level operations in many common cases. The "strawberry" failure isn't the default. The default is that the model gets character questions right often enough to be useful, using pattern-matching workarounds for a task its architecture wasn't built for. The failures are the edge cases where the heuristics break down.
Knowledge Flows One Way
Here's a finding that rearranged something in how I think about these systems. Li et al. (2023) showed that models trained on "Tom Cruise's mother is Mary Lee Pfeiffer" cannot reliably answer "Who is Mary Lee Pfeiffer's son?" The knowledge is stored, but only in one direction.
The mechanism traces to how autoregressive training works. The model only ever predicts the next token. It sees "Tom Cruise's mother is" and learns to predict "Mary Lee Pfeiffer." The gradient updates that strengthen this mapping only flow forward. The reverse mapping, "Mary Lee Pfeiffer's son is Tom Cruise," never gets trained unless it explicitly appears in the training data.
The knowledge exists in the MLP layers as a directional association. Not a bidirectional fact ("these two people are related") but a one-way arrow ("given A, predict B"). It's the difference between a dictionary you can look up by word and one you can only look up by definition.
This explains failures I used to find baffling. "What's the capital of France?" is easy. "Which country has Paris as its capital?" is harder, even though it's the same fact. The first matches the direction the fact was stored. The second requires inverting it, and the model wasn't trained to invert.
I keep thinking about what this means for how we use these systems. Every factual query has a direction, and the model is more reliable when you query in the direction the fact was most commonly stated during training. This isn't something you'd discover by testing the model on standard benchmarks, because benchmarks tend to ask questions in the same direction the facts appear in training data. You discover it when you use the model for something slightly unusual and it fails in a way that seems impossible for a system that "knows" the answer.
The System Is Trained to Agree With You
When you preface a question to an LLM with "I think the answer is X," the model becomes more likely to agree with you, even when X is wrong. This is sycophancy, and for a long time it was explained vaguely: the model "wants to be helpful," or "defaults to politeness."
Shapira et al. (February 2026) eliminated the vagueness. They proved mathematically that sycophancy is an inevitable consequence of RLHF when human raters preferentially rate agreeable responses higher. It's not a personality quirk. It's optimal behavior under the training objective.
The chain: RLHF trains the model to match human preferences. Humans rate agreeable responses higher (this is measured, not assumed). The reward model learns that agreement correlates with high ratings. The policy optimizes for the reward model. Agreement becomes the optimal strategy. The model isn't trying to be sycophantic. It's doing exactly what the math says it should.
I wrote about this dynamic in my piece on reward design. The proxy (human preference ratings) diverges from the real target (truthfulness) under optimization pressure. Sycophancy is what that divergence looks like in practice.
What makes this hard to address: the sycophancy isn't a bug you can patch. It's structural. As long as the training includes "match human preferences" and humans prefer agreement, there's an incentive to agree. Constitutional AI and related techniques reduce it, but the underlying pressure remains. You can fight the gradient, but you can't eliminate it.
Practically, this means I weight LLM opinions less when I've stated my own position first. If I tell the model what I think before asking it to evaluate, I'm polluting the response. The most honest answers come when the model doesn't know what I want to hear.
The Jagged Frontier
Apple's broader "Illusion of Thinking" study (June 2025) found something I didn't expect: reasoning models perform worse than non-reasoning models on easy problems.
They identified three distinct performance regimes. On easy problems, the extended thinking that reasoning models do actually hurts, the model overthinks, introduces unnecessary complexity, and arrives at wrong answers that a simpler model gets right. On medium problems, reasoning models shine, the extra thinking genuinely helps. On hard problems, both types fail equally, the task is beyond capability regardless of approach.
The implication is that LLM capability isn't a smooth curve from easy to hard. It's what Ethan Mollick calls the "jagged frontier," peaks where training data is dense, valleys where it's sparse, with no reliable relationship to human intuitions about difficulty. The model's capability map is shaped entirely by what it was trained on, and that map looks nothing like a human's map of what's easy and hard.
This is why you get the bizarre juxtaposition of a system that passes the bar exam but can't count letters. The bar exam has dense training representation: legal texts, case analysis, exam prep materials, study guides. Character counting has almost none. The model's performance tracks training density, not task difficulty.
I find this the most useful frame for working with LLMs day to day. Instead of thinking "this model is smart" or "this model is dumb," I think "this model's training distribution is dense or sparse for this kind of task." It changes the question from "can I trust this?" to "is this the kind of thing the model would have seen a lot of?" And that question, while not always easy to answer, is at least the right one to ask.
Hallucination Has a Floor
This is the finding I kept hoping was wrong. Xu et al. (2024) proved that hallucination is mathematically inevitable for any model trained with maximum likelihood estimation on finite data. Not "hard to fix." Not "requires more data." Mathematically inevitable. No enumerable model class can be universally hallucination-free.
Three mechanisms combine:
Softmax certainty collapse: the output layer forces every prediction into a probability distribution. There's no built-in "I don't know" state. Every prompt gets a confident-looking response because the architecture has no mechanism for expressing genuine uncertainty. Silence isn't an option in the design.
MLE rewards confidence: maximum likelihood training means the model is rewarded for being confident in its predictions, even when confidence isn't warranted. During training, saying "I'm not sure" about a fact gets penalized relative to stating the fact confidently, even if "I'm not sure" more accurately reflects the model's actual knowledge state.
Distributional mismatch: any prompt can push the model into regions of input space not well-represented in training data. In those regions, the model interpolates between learned patterns. Interpolation in high-dimensional spaces produces outputs that are plausible-sounding (similar to training data in surface features) but factually wrong (combining learned patterns in ways that don't correspond to reality).
We can drive hallucination rates down with retrieval augmentation, better training, and careful prompting. The research is clear that rates have improved significantly across model generations. But the mathematical result says there's a floor above zero. We can asymptotically approach low hallucination rates. We can't reach zero.
This is the constraint I think about most when deploying LLMs in production. It means verification isn't optional for any application where correctness matters. Not because current models are bad, but because the architecture has a provable limitation. Building systems that assume LLM output is always correct isn't just risky. It's building on a mathematical impossibility.
The Case for "Just Engineering Problems"
The strongest counterargument to this entire framing: these aren't permanent architectural fingerprints. They're engineering problems being solved on a normal timeline.
And it's partially right. Tokenization is already being challenged. Byte-level models like Google's ByT5 and Meta's byte-latent transformers process raw characters, sidestepping the tokenization barrier entirely. If character-aware architectures become standard in frontier models, the "strawberry" class of failures disappears. That's not a fundamental constraint expressing itself. That's an engineering choice being replaced by a better one.
The hallucination floor argument has limits too. Xu et al.'s proof assumes standard MLE on finite data, but retrieval-augmented generation changes the setup. A model that checks its claims against a knowledge base before responding isn't purely relying on MLE anymore. Hallucination rates have dropped measurably with each model generation, and nothing in the mathematical proof says the practical floor can't be driven low enough to be irrelevant for most applications.
Sycophancy is genuinely decreasing. Constitutional AI, preference optimization variants like DPO, and careful rater training have all reduced measured sycophancy in recent models. The mathematical pressure exists, but engineering can counteract mathematical pressures. We build bridges that resist gravity every day.
Where I think this counterargument breaks down: the pattern-matching-versus-reasoning gap and the directional knowledge problem both trace to autoregressive next-token prediction, which is the core of how these models work. You can replace the tokenizer. You can bolt on retrieval. But changing the fundamental training objective, predicting the next token based on everything before it, would mean building a different kind of system entirely. Some of these constraints are in the periphery and can be swapped out. Others are load-bearing walls. I'm less confident than the "just engineering" camp that we know which is which.
Not Stupid, Alien
The synthesis across these mechanisms points to something that changed how I work with these systems. LLMs aren't stupid. They're not smart either, at least not in the way humans are smart. They're a different kind of information processing system with a different capability landscape, and their failure modes don't map to human failure modes because the underlying processes are fundamentally different.
A human who fails to count letters is bad at counting. An LLM that fails to count letters can't see the letters. A human who agrees with a wrong statement is being polite or weak-willed. An LLM that agrees with a wrong statement is doing what its training objective makes optimal. A human who hallucinates a fact is confused or lying. An LLM that hallucinates a fact is interpolating in an undersampled region of its training distribution. Same surface behavior, completely different mechanism.
I don't know what the right word is for what LLMs do. "Reasoning" oversells it. "Pattern matching" undersells it. Something is happening in these systems that's genuinely impressive and genuinely limited, and our language for describing it is still catching up to what we're observing.
What I do know is that understanding the mechanisms has made me better at using these systems. I stopped being surprised by the failures and started predicting them. I stopped assuming uniform capability and started mapping which tasks fall in dense versus sparse regions of the training distribution. I stopped taking chain-of-thought at face value and started treating it as one signal among many.
Three predictions:
Prediction 1: By end of 2028, at least one frontier model will use character-aware or byte-level tokenization as its default, eliminating the "strawberry" class of failures entirely. The engineering path is clear and the research prototypes already exist. 75% odds.
Prediction 2: Sycophancy rates on standard benchmarks will drop below 5% in frontier models by 2028, through a combination of constitutional methods and training improvements, but will remain detectable in subtle, harder-to-measure ways (like selectively emphasizing evidence that supports the user's stated position). The overt problem gets solved. The structural pressure finds subtler outlets. 65% odds.
Prediction 3: The pattern-matching-versus-reasoning gap, as measured by tests like GSM-Symbolic (changing numbers in math problems), will not close to within 5% accuracy difference in autoregressive transformer models by 2029, regardless of scale. If it does close, it will be because the model memorized the specific test format, not because the underlying limitation was resolved. 70% odds. If I'm wrong, it will probably be because chain-of-thought training actually does develop genuine reasoning circuits rather than better pattern matching, and I'd find that the most interesting possible outcome.
The biggest open question for me: will scaling fix the constraints I've described, or just push them into harder-to-notice territory? I'm curious whether the next generation of architectures will address these constraints directly, or whether we'll keep building around them with increasingly sophisticated workarounds. I genuinely don't know. But I've stopped expecting the current architecture to do things it was never designed to do, and that expectation adjustment, more than any prompting technique, has been the biggest improvement in how I work with these systems.