Debugging as a Window into How AI Thinks
I spent an hour and a half watching an AI fail to solve a five-minute problem. That session changed how I think about AI coding tools.
The problem was mundane. Connect a FastAPI backend to a Dockerized Postgres instance. Credentials in a .env file. The kind of thing you bang out and move on from. Except Gemini CLI could not move on.
Over 90 minutes, it tried restarting the container, resetting the password, changing the host to localhost, trying no password at all, changing the user to root, setting authentication to trust, pruning Docker, running diagnostic Python scripts, and cycling through variations on all of these. Each suggestion was reasonable in isolation. None of them worked, because none of them addressed the actual problem.
The error message said role "appuser" does not exist. The answer was right there. My .env file specified appuser, but the Docker container only had postgres as a user. Classic credential mismatch. But Gemini treated that error message as one clue among many, not as the most probable explanation.
I switched to Claude. Response: "You're using appuser in your connection string, but your Docker container initialized with postgres as the only user. Update your .env to use postgres:myrootpassword, or create appuser manually."
Done. Five minutes. Moved on.
Same problem. Same information available. Radically different approach. That gap, 1.5 hours versus 5 minutes, is what this essay is about. Not because one tool is better than the other, but because watching that gap reveals something about how these systems approach problems. And once you see the pattern, you can't unsee it.
My First Wrong Frame
My initial reaction was simple: Claude is smarter than Gemini.
I sat with that for a while before realizing it was too easy. Both are large language models. Both have been trained on enormous amounts of code, documentation, and troubleshooting guides. Both can write a perfectly competent Postgres tutorial from scratch. So raw capability wasn't the difference.
I tried a few other explanations. Maybe Gemini had a bad day, but I saw the same pattern in other sessions. Maybe I prompted them differently, but I checked and I gave both the same context: the error message, the connection string, the Docker command. Maybe it was the product, not the model, since Gemini CLI and Claude are different products with different system prompts and UX wrappers. That felt closer, but the behavioral difference was so consistent across different types of problems that it seemed like more than product design.
The frame I landed on, and the one I still find most useful, is that they have different problem-solving modes.
Checklist Mode and Hypothesis Mode
Here's what I noticed watching the two sessions side by side.
Gemini was running through possibilities. It had what felt like an internal list of "things that can go wrong with Postgres connections" and it was executing that list sequentially. Restart the container. Reset credentials. Check the network. Adjust the auth method. This approach is systematic, comprehensive, and slow. It's the approach you'd find in a troubleshooting guide. Step 1, Step 2, Step 3. If none of those work, continue to Steps 4 through 10.
I've started calling this checklist mode. Exhaustive search over a known solution space. Safe. Thorough. Doesn't prioritize by the specific evidence in front of it.
Claude did something different. It looked at the available information, the connection string with appuser, the Docker command with postgres, the error message about a missing role, and went directly to the most probable explanation. It formed a specific hypothesis and tested it with minimal steps.
I call this hypothesis mode. Pattern-match to the most likely cause. Test it. If wrong, update and try again. Fast when the likely cause is correct. Potentially blind to unusual problems that don't match known patterns.
I don't think I invented this distinction. It probably maps to something well-established in cognitive science or decision theory. But I arrived at it by watching behavior, not by reading about it first, and that felt important. The pattern was visible before I had a name for it.
Neither mode is always right. If you're facing a truly strange edge case, something that doesn't match known patterns, checklist mode might find it where hypothesis mode would miss it entirely. If you're facing a common configuration error (which most debugging sessions involve), hypothesis mode is dramatically faster. The problem is that most people treat AI coding tools as interchangeable. "I'll use my AI assistant." But knowing which mode you're getting, that's a meta-skill that actually matters.
Testing the Frame Across Sessions
A single debugging session isn't much evidence. I've been watching for this pattern since that Postgres incident, across different problem types and different tools. Some observations that make me more confident the frame points at something real:
CORS configuration. I had a FastAPI CORS issue. Gemini suggested roughly a dozen different CORS configurations, headers, middleware options, essentially an inventory of "things that affect CORS." Claude noticed I was configuring CORS after mounting the routes and pointed out the order dependency. Same dynamic: broad coverage versus targeted diagnosis.
Build failures, API integration, config errors. The specific checklists and hypotheses change depending on the problem domain, but the modes are recognizable. Gemini tends to cover more ground. Claude tends to jump to the most likely cause first.
Prompting can shift the mode, but only partially. When I told Gemini "focus on the most likely cause based on the error message," it became somewhat more targeted. But it still wanted to cover bases. The underlying tendency persisted. It was like asking a thorough person to skip steps. They can do it, but it goes against the grain.
I want to be honest about the limits of this evidence. A few dozen sessions across a few months is better than one session, but it's not a systematic study. The pattern could be shaped by the specific problems I tend to encounter, or by how I tend to describe problems. I'm building a model from observation, and I know observation can fool you.
What Might Cause the Difference
I'm going to speculate here, and I want to flag that this is speculation.
Gemini's checklist behavior could reflect training emphasis on troubleshooting documentation. If you've read a lot of "How to fix Postgres connection errors" articles, the structure is always: check A, check B, check C, check D. Training on that structure would naturally produce systematic, coverage-oriented responses.
Claude's hypothesis behavior could reflect training emphasis on expert problem-solving. When an experienced engineer sees role "appuser" does not exist, they don't run through a checklist. They read the error, match it to a probable cause, and test that cause. If the training data emphasizes that kind of reasoning, the model would naturally produce targeted, evidence-weighted responses.
Another possibility I should take seriously: it might be about the product, not the model. Both tools have system prompts I can't see. Maybe Gemini's system prompt says "be thorough and systematic." Maybe Claude's says "be efficient and direct." The behavioral difference might be a product decision at the application layer, not a model characteristic at the weights layer.
Or maybe it's about context window management. The tools handle conversation history differently. Maybe Gemini's checklist behavior emerges from how it chunks and retrieves context, not from anything about its reasoning process.
I don't have good answers to these questions. What I have is a model that predicts behavior well enough to be useful. That's not the same as understanding the mechanism, and I try not to confuse the two.
The Loop Problem
There's another pattern I've observed that I think connects to something deeper about AI cognition: loops.
Working with Cursor, I asked it to apply only a subset of its proposed changes. Something like: "I want changes A and B, but not C. Apply just A and B."
It couldn't do this cleanly. Instead, it entered a 3-4 iteration loop. It proposed changes including C. I rejected C. It proposed changes including C again. I rejected again. Same thing a third time. Eventually it got it right, but the path to getting there was painful.
I've seen this pattern with other tools and in other contexts. Ask for partial acceptance of a suggestion, and the AI gets confused. Its internal representation of "what the file should look like" conflicts with the actual file state, and it tries to reconcile them by re-proposing the changes you already rejected.
Here is what I think is happening. The AI maintains some representation of a desired end-state based on its full analysis. When you accept only part of the suggestion, that representation doesn't update cleanly. Each round feels somewhat independent, with limited memory of what was specifically rejected. It "knows" what the file should look like. When the file doesn't look like that, it tries again. And again.
What's missing is something you could call a "stuck" signal.
Human debuggers notice when they're looping. There's a feeling of "wait, I already tried this" or "this approach isn't working." We call it frustration, and frustration is genuinely useful information. It tells you to change strategies. To step back. To try something fundamentally different.
AI doesn't have frustration. It doesn't have a meta-level process watching the problem-solving and saying "you've been stuck for 10 minutes, change your approach." It does object-level problem-solving, generating solutions within a context, but it doesn't monitor that problem-solving from above.
This might sound like I'm anthropomorphizing. I don't think I am, or at least I don't think the observation requires anthropomorphism to be useful. Whether we call it meta-cognition, self-monitoring, or something more mechanical, the behavioral fact remains: AI tools don't reliably detect when they're stuck. And that limitation has practical consequences.
The Brittleness Default
There's a pattern I noticed in the same Cursor session that I initially thought was separate, but I've come to think it connects to the same underlying limitation.
I watched Cursor solve a problem by hardcoding logic based on specific phrases. The task involved responding differently based on conversation history. Instead of building something flexible, it wrote something like:
if "cancel my order" in last_message.lower():
return handle_cancellation()
elif "where is my package" in last_message.lower():
return handle_tracking()
This is the kind of code that breaks the moment it encounters real users. "I need to cancel" doesn't match. "Track my shipment" doesn't match. It solves the exact test cases visible in the context and fails as soon as inputs vary even slightly.
Brittle logic, not robust abstraction. Why does AI default to this?
A few possible explanations. Training data contains a lot of specific hacks and one-off solutions, because that's what a lot of real code looks like. The model optimizes for the immediate context (what's in front of it right now) rather than future contexts (what inputs might come later). And perhaps most interesting: it can't easily simulate "what happens when this fails?" Writing robust code requires reasoning about situations you're not currently looking at, imagining users you haven't seen, anticipating inputs you haven't received. That's a form of stepping outside the immediate context to evaluate the approach from a broader perspective.
Here's why I think brittleness connects to the loop problem and to checklist mode. All three might be symptoms of the same limitation: weak meta-cognition. The model is good at object-level problem-solving within a given context. It's less good at evaluating whether the context is right, whether the approach is working, or what other contexts might look like.
Checklist mode: solving within a fixed repertoire without evaluating whether the repertoire fits the evidence. Loops: solving without detecting that the solution keeps failing. Brittleness: solving for the visible case without imagining invisible cases. Same limitation, different manifestations.
I could be wrong about this. These might be three genuinely separate phenomena that I'm forcing into a single frame because unified theories are satisfying. I'll flag it as a hypothesis I'm still testing.
The O(n-squared) Problem
One more pattern, from a different context. Using Claude for an ML project, I watched it grab 10+ features without any exploratory data analysis. Just threw everything into the model. No feature selection, no checking distributions, no thinking about what might actually matter. It also wrote O(n^2) logic, nested loops over DataFrames, that was correct but would never scale to production data sizes.
This is worth sitting with. The code worked. It passed every test you could throw at it. But "works" and "works at scale" are different things, and the AI defaulted to the first without considering the second.
I think this connects to something about what these models learn from training data. They're trained on code that works, not code that scales. They learn correctness, not efficiency. They learn patterns, not the trade-offs behind those patterns. An engineer with production experience knows that nested DataFrame loops are a red flag. The model sees that the loops produce the right output and moves on. The judgment about scalability requires imagining a future context (production data volumes) that isn't present in the immediate context.
Again, this looks like the meta-cognition gap. Evaluating a solution requires stepping outside the solution to ask "is this good enough for the world it will live in?" The model answers the question in front of it. The question of whether the answer is production-ready is a different question, one that requires reasoning about contexts the model can't currently see.
What This Has Changed in Practice
Understanding these patterns has changed how I work with AI tools day to day. Not in dramatic ways, but in small adjustments that compound.
I watch for mode. When I start a debugging session, I pay attention to whether the AI is in checklist mode or hypothesis mode. If it's covering bases when I need a targeted diagnosis, I'll steer it toward hypotheses. If it's jumping to conclusions when I need thoroughness, I'll ask for systematic coverage. Matching the mode to the problem matters more than which tool I'm using.
I notice loops earlier. When an AI suggests the same change twice, I don't just reject again. I recognize we're in a loop and either rephrase my request, give completely fresh context, or do the edit manually. Fighting the loop is usually slower than stepping around it.
I compensate for brittleness. When AI proposes a solution, I now automatically ask myself "what inputs would break this?" If the answer is "lots of them," I either prompt explicitly for edge cases or handle the robustness myself.
I've adjusted tool selection based on problem type. Common problems where speed matters: lean toward hypothesis-style tools. Unusual problems where I might miss something: use checklist-style approaches, or explicitly prompt for systematic coverage.
None of this is remarkable. It's just paying attention to patterns and adapting. But the compounding effect is real. I'm noticeably faster than I was before I started thinking about these modes, and I waste less time fighting AI tools' tendencies instead of working with them.
A Testable Model
I've been building this model from observation, and models are only useful if they make predictions. Here's what mine predicts:
Checklist-style tools should be slower on common problems but more thorough on unusual ones. You could test this by presenting the same problem set to different tools, with some problems being common (should match patterns) and some unusual (shouldn't match patterns). Checklist tools should underperform on common problems but match or beat hypothesis tools on unusual ones.
You can shift modes with prompting, but there are limits. "Focus on the most likely cause" should make a checklist tool somewhat more hypothesis-driven. "Systematically check all possibilities" should make a hypothesis tool more checklist-driven. But the shift shouldn't be complete. The underlying tendency should persist.
Loop behavior should decrease as state management improves. Tools with better mechanisms for tracking what's been tried and rejected should loop less. This predicts that as AI tools add more sophisticated session management, we should see fewer loops. That seems like a testable prediction as tools evolve.
Brittleness should decrease with specific prompting about edge cases. If you explicitly ask "what edge cases could break this?" or "make this robust to input variations," the brittleness should decrease. The models know how to write robust code. They just don't default to it. The knowledge is there; the meta-cognitive trigger to apply it is what's missing.
I haven't tested these predictions systematically. They're offered as falsifiable claims, not as conclusions. If someone ran the experiments and the predictions failed, I'd need to update the model. That's as it should be.
The Deeper Question I Can't Resolve
Debugging has always been interesting to me because of what it reveals about cognition. When a human debugs, there's a story happening underneath: hypothesis formation, memory retrieval, pattern matching, frustration, strategy shifts. We can't verify this story directly (introspection is famously unreliable), but something structured is clearly going on.
When AI debugs, something is happening too. The model processes context, generates candidate solutions, adjusts based on feedback within a session. There's structure to the behavior. The structure is consistent enough across sessions to categorize and predict.
Is that "thinking"?
I don't have a good answer. The word carries a lot of baggage. It might be a category error to ask whether AI "thinks" the way we do. These models are doing something when they problem-solve, but whether that something shares enough properties with human thinking to deserve the same word, I genuinely don't know.
What I'm more confident about is the practical claim: whatever we call it, the behavior is structured enough to study, consistent enough to predict, and different enough across tools to matter for how you work with them. You don't need to resolve the philosophical question about AI cognition to benefit from understanding AI behavior at the level of patterns.
What would it take for AI to have real meta-cognition? To notice it's looping? To feel that a solution is brittle? I find myself genuinely curious about this, not as a philosophical exercise but as an engineering question. Could you build a meta-cognitive monitor as a separate process that watches the problem-solving and intervenes? Would that be enough, or does meta-cognition need to be integrated into the reasoning itself? Is frustration, the subjective experience of being stuck, actually necessary for the behavioral shift it produces? Or could you get the same behavioral result through a purely mechanical "stuck detector"?
I don't know. I'm not sure anyone does yet. But watching AI debug has made me think about these questions differently. Not as abstractions about consciousness, but as engineering problems about what's missing from current systems, and what it would take to add it.
Where This Leaves Me
Debugging is a window into problem-solving. If you want to understand how an AI approaches problems, watch it debug. Pay attention to the sequence of actions, what it tries first, what evidence it weighs, how it responds to failure, whether it detects when it's stuck.
The patterns will teach you something about the tool. Checklist or hypothesis. Loop-prone or adaptive. Brittle or robust by default. These aren't fixed categories but tendencies that you can observe, predict, and work with.
The patterns might also teach you something about problem-solving itself, about what meta-cognition actually does, about why frustration exists, about the difference between solving a problem and knowing you've solved it well.
I started this by watching an AI fail for 90 minutes. I've ended up thinking about the nature of thinking. I'm not sure I've arrived anywhere definitive, but the questions feel more precise than they did before. And in my experience, precise questions are worth more than vague answers.
If you've noticed similar patterns, or different ones entirely, I'm curious to hear about it. The model I've built here is from one person's observations. It would be stronger, or correctly broken, with more data.
Open Questions
Is "checklist vs. hypothesis" the right frame? Maybe there are more modes I'm missing. Maybe the distinction is wrong entirely. I'm most confident about the behavioral observations and least confident about my taxonomy. What would increase my confidence: seeing the same pattern across many more sessions, ideally with different people observing independently.
How much is model versus product? Gemini CLI and Claude aren't just different models. They're different products with different UX wrappers, system prompts, and context management. I don't know how to separate model behavior from product design with the information available to me. This is one of my biggest uncertainties.
Can you reliably shift an AI's mode? I've had partial success with explicit prompting. But I haven't tested it rigorously, and I don't know where the limits are. This seems like a tractable question someone could answer with careful experimentation.
What happens as models scale? Does hypothesis-style reasoning emerge with scale? Does checklist behavior decrease? I'd love to see data on this, and I don't have any.
Is the meta-cognition gap fixable? Is the lack of "stuck detection" fundamental to current transformer architectures, or is it an engineering problem that better tooling could address? I genuinely don't know enough about the internals to have a strong view, but the question matters a lot for where these tools go next.
What would change my mind: Evidence that the patterns I describe are artifacts of my prompting rather than genuine tool tendencies. Systematic studies comparing debugging behavior across tools, versions, and problem types. Better understanding of what's actually happening inside these models during problem-solving. I'm most uncertain about whether my explanations of why these patterns occur are correct. The behavioral patterns themselves feel more solid than any theory I've built on top of them.