I spent the first three months of building with AI coding assistants learning one thing: how to say no.

Not because the AI was wrong. It was usually correct. The code compiled. The logic was sound. The patterns were clean. But "correct" and "necessary" aren't the same thing, and the gap between them is where something important lives. Something I've been trying to name for months. I've been calling it "engineering taste," though I'm not sure that's the right word. Whatever it is, it's the thing that lets you look at working code and say "yes, but not this."

This essay is my attempt to figure out what that thing actually is, where it comes from, and whether AI can ever develop it. I don't have clean answers. But I have a lot of concrete experiences, and they keep pointing in the same direction.

The Abstraction That Wasn't

Here's a story that crystallized the problem for me.

I was building a lesson plan generator, an app that takes learning objectives and produces structured lesson content using LLMs. I asked Claude to help improve the architecture. Within minutes, it proposed a LessonPlanConfig dataclass:

@dataclass
class LessonPlanConfig:
    subject: str
    grade_level: int
    duration_minutes: int
    learning_objectives: List[str]
    # ... more fields

Textbook-perfect suggestion. Type safety. Clear interface. Easier to extend later. The kind of thing you'd see in a post about clean architecture.

But I was passing three arguments. Three. subject, grade_level, and duration. The dataclass added a new file, a new import, a new abstraction layer, all to wrap three parameters that fit comfortably in a function signature. It was solving a problem I didn't have, creating complexity I'd need to maintain.

So I said no.

Then I noticed something worse.

While Claude was proposing elegant abstractions, it had quietly dropped the database schema changes I actually needed. The feature I'd asked about was an LLM-as-judge evaluation system: use one LLM to score the quality of lesson plans generated by another. Claude had built the evaluation logic beautifully. Structured prompts. Scoring rubrics. Clean output parsing.

But it forgot to persist the scores anywhere. Without storage, the whole feature was useless. You could evaluate a lesson plan, see the score, and then it would vanish. No historical tracking. No comparison across runs. The shiny part was polished. The foundation was missing.

The AI had optimized for local elegance while missing global necessity. It polished the visible parts and forgot the piece that made them matter.

When I Didn't Say No

I wish I could say I caught this pattern immediately. I didn't.

Two weeks earlier, on the same project, Claude had suggested a PromptTemplate abstraction, a class hierarchy for managing different types of prompts with inheritance and polymorphism. It looked professional. It felt like "real engineering." I said yes.

Three days later I was debugging why a simple prompt change wasn't taking effect. The answer: the change was in the base class but a subclass override was masking it. I spent an hour tracing through inheritance chains to understand my own code.

The abstraction had maybe 40 lines of code. The debugging session cost more than writing the whole thing from scratch would have.

That's when I started paying attention to what I was accepting without thinking. The pattern was clear once I looked: I was saying yes to things that looked like "good engineering" without asking whether they solved problems I actually had.

This felt familiar, and not in a good way. I'd been the junior engineer who read all the books. The one who knew that three parameters should become a config object, that repeated code should become a function, that related classes should share an inheritance hierarchy. I knew the patterns. I didn't know when to ignore them. The AI was doing the same thing. Just faster.

The Pattern Repeats

Once I started watching for it, the pattern showed up everywhere.

In an ML project predicting patient appointment show rates, Claude grabbed 10+ features without any exploratory analysis. Age, gender, appointment time, days since last visit, insurance type, weather forecast, distance from clinic, previous no-show rate. Everything plausibly relevant got thrown in.

No correlation analysis. No check for multicollinearity. No domain reasoning about which features might carry signal. No consideration of what data we'd have at prediction time versus training time. The model would train. It would probably overfit. And when predictions went wrong, we'd have no idea which features were driving the decisions.

Same project, same session: I found nested loops over DataFrames.

for idx, row in df.iterrows():
    for other_idx, other_row in df.iterrows():
        if some_condition(row, other_row):
            # do something

On a 500K row dataset, that's O(n^2), roughly 250 billion iterations. The code was correct. It was readable. It would run until the heat death of the universe.

When I pointed this out, Claude cheerfully rewrote it with vectorized operations. It knew how to write efficient code. It just didn't, until prompted. Same pattern: .apply(lambda x: ...) instead of vectorized ops, recomputing expensive operations inside loops, treating a DataFrame like it was a list of dicts instead of a columnar data structure.

This is what I mean by "correct but not necessary" and its cousin, "correct but not sufficient." Working code that misses what actually matters for the system to work in practice.

Checklist Debugging vs. Diagnostic Debugging

The pattern showed up in debugging too, and here it was even more stark.

I spent 1.5 hours with Gemini CLI on what should have been a simple Postgres connection issue. I had a Docker container running Postgres, a FastAPI backend, and a .env file with the connection string. The error was clear: FATAL: role "appuser" does not exist.

Gemini's approach: restart the container. Reset the password. Change the host. Try no password. Change the user. Try trust authentication mode. Prune Docker. Run diagnostic Python scripts. Over and over. Each suggestion was individually reasonable. Together, they amounted to guessing.

Claude solved it in about five minutes. "You're using appuser in your connection string but your container initialized with postgres. Update your .env or create the user."

Done. One observation. One fix.

I want to be careful here because this isn't about Claude vs. Gemini. That comparison is incidental. What matters is the difference in problem-solving style: checklist debugging versus diagnostic debugging.

Checklist debugging tries everything that might be relevant. It covers all bases. It's what you do when you don't understand the problem well enough to form a hypothesis. Diagnostic debugging starts with "what's the simplest explanation?" and tests that first. It requires a model of how the system works, not just a list of things that can go wrong.

I've done both kinds in my career. Early on, I tried random things until something worked. The shift to diagnostic debugging happened gradually, through building enough mental models that I could form hypotheses instead of just guessing. Both approaches eventually converge on the right answer. But diagnostic debugging gives you the fix and a model of why it broke, which helps you avoid similar problems later.

The Junior Dev With All the Books

I started describing this pattern to colleagues as "working with a junior engineer who's read all the right books." The metaphor stuck because it described me five years ago.

I could tell you about SOLID principles, design patterns, clean architecture. I'd read the books and done the tutorials. And I'd built systems that were perfectly architected and impossible to maintain.

What I lacked was calibration. I applied patterns without knowing when the pattern fit. I optimized for "does this follow best practices?" before I could answer "does this matter?"

The books don't tell you when to break the rules. They don't tell you that sometimes three parameters are fine, that the abstraction creates more cognitive load than the "problem" it solves, that consistency matters less than clarity and clarity matters less than shipping.

That knowledge came from experience. From shipping code and watching it break. From building the "elegant" abstraction and then having to explain it to five different teammates. Slowly, the patterns became guidelines instead of rules. I developed something that let me look at a suggestion and just feel whether it fit, before I could articulate why.

AI doesn't have that trajectory. It's stuck at "knows the patterns." And it's stuck there while operating at superhuman speed, producing pattern-compliant code faster than you can review it.

What Is Engineering Taste, Actually?

I keep using the word "taste" but it felt vague when I started thinking about it seriously. I tried other frames.

"Experience" didn't capture it. I know experienced engineers with poor judgment and junior engineers with surprisingly good instincts. Time served isn't the variable.

"Knowledge" wasn't right either. The AI has more knowledge than I ever will. It's seen more code, more patterns, more failure modes documented across millions of Stack Overflow threads and GitHub repositories.

"Intuition" was closer but still fuzzy. What makes the intuition good?

Here's where I landed, and I want to be upfront that this is a working hypothesis, not a conclusion: engineering taste is an implicit model of what matters, applied to decisions faster than conscious reasoning.

When I looked at that LessonPlanConfig suggestion and immediately felt "no," I wasn't running through a checklist. I didn't think "well, the function only has three parameters, and the likelihood of needing more is low..." That reasoning came later, when I had to explain my decision to myself. In the moment, something just said "this adds more than it helps." The judgment came first. The justification came after.

I notice this in other domains too. When I'm cooking and something tastes off, I know it before I can identify what's wrong. Pattern recognition operating below conscious thought. Maybe that's all taste is: your brain matching the current situation against a vast library of past situations and their outcomes, without surfacing the matching process. A compressed model of consequences.

But I'm genuinely uncertain about this framing. Maybe taste is something else entirely. Maybe it's about values, or aesthetics, or some interaction between experience and personality that I haven't identified. I'm treating this as one lens, not the answer.

How Taste Gets Built (Maybe)

If taste is an implicit model, what builds it? I've been examining my own experience and watching other engineers develop, or fail to develop, judgment over time. Here's what I think I see, though I'm extrapolating from a limited sample.

Exposure to consequences. You learn what matters by seeing what happens when you get it wrong. The feature that seemed elegant but broke in production. The abstraction that made sense until requirements changed. The optimization that saved milliseconds but cost weeks of debugging time.

Each failure updates your model of "what actually matters." The key word is failure. Success teaches you that something worked, but not why. Failure teaches you what matters by showing you what happens when you ignore it.

My PromptTemplate debugging session taught me more about abstraction costs than any book on clean code ever did. Not because the lesson was new. I'd read about premature abstraction before. But reading about it and spending an hour trapped inside your own inheritance hierarchy are different experiences. The second one sticks.

Tight feedback loops. Consequences only teach if you see them. An engineer who ships code and then moves to the next project never learns whether their decisions were good. An engineer who ships code and maintains it, fixes the bugs, handles the edge cases, extends the features, learns constantly.

This is why ownership matters for developing judgment. Not ownership in the corporate accountability sense, but ownership in the "you will feel the consequences of your decisions" sense.

Variation across domains. Taste built from one type of project doesn't transfer cleanly to others. I have decent intuition for backend services but poor intuition for frontend performance. When I work on React code, I notice myself reverting to pattern-following mode. The best taste probably comes from varied experience across different domains, constraints, and failure modes.

Reflection, not just repetition. Experience alone isn't enough. You have to actually think about it. I only started developing better judgment when I started asking "why did I think that was a good idea?" after things went wrong. Before that, I was accumulating experience without extracting the signal from it.

There's a question buried here that I haven't resolved: is this process the only way to build taste? Or are there faster paths? I'll come back to this.

Why AI Can't Have Taste (The Strong Claim)

Here's where I want to make a claim that I'm genuinely uncertain about but think is worth stating clearly.

If taste is an implicit model built through consequences, feedback, variation, and reflection, then AI has a structural problem. Not a temporary limitation. A structural one.

AI has massive exposure but no consequences. AI has seen more code than any human ever will. Millions of repositories, billions of lines. But it doesn't experience consequences. When AI suggests an over-engineered abstraction, nothing bad happens to the AI. There's no feedback signal that says "this suggestion was locally correct but globally wrong."

Training isn't the same as consequences. AI is trained on human feedback, but that feedback is about whether the output looks good, not whether it worked in the long run. The human labeler rates the code in isolation. They don't know that six months later, this abstraction became a maintenance nightmare.

AI optimizes for proxies, not outcomes. This is Goodhart's Law applied to code generation. AI is trained to produce outputs that score well on some metric, whether that's human ratings, code quality scores, or test passage rates. These metrics are proxies for "actually useful code." But proxies and reality diverge in exactly the cases that matter most.

A dataclass wrapper might score well on "clean architecture" metrics while being unnecessary for this specific codebase. Comprehensive feature engineering might look thorough to a reviewer while missing the analysis that would reveal which features actually matter. The AI maximizes the proxy. Taste is about knowing when the proxy fails.

AI has no skin in the game. Nassim Taleb uses this phrase to describe the difference between people who bear the consequences of their decisions and people who don't. AI doesn't maintain the code it writes. It doesn't debug production failures at 2am. It doesn't have to explain its architecture to future engineers who will inherit the codebase.

Without skin in the game, you don't develop the visceral sense of what matters. You might know intellectually that simple code is easier to maintain, but you don't feel it until you've spent a weekend untangling someone else's clever abstraction. Humans develop taste because bad decisions hurt. AI doesn't hurt.

That's the strong claim. Now let me try to break it.

But Maybe I'm Wrong

I want to take the counter-argument seriously, because I might be describing current limitations rather than structural ones. There are at least three ways my argument could fail.

What if AI could be trained on consequences? Imagine training data that included not just code, but what happened to that code over time. "This abstraction was introduced in commit X. In commits Y through Z, it was the source of 15 bugs. In commit W, it was removed and replaced with something simpler." With that kind of longitudinal signal, an AI could potentially learn "abstractions like this tend to cause problems" without experiencing the problems directly. The consequences would be encoded in the training data.

What if RLHF could capture long-term outcomes? Current reinforcement learning from human feedback asks "does this code look good?" But you could imagine a system that tracks whether generated code survived in codebases, whether it was refactored quickly, whether it introduced bugs. That's closer to real consequences than a thumbs-up from a labeler.

What if AI could simulate consequences? Before suggesting an abstraction, AI could run something like: "If I add this abstraction, what happens when requirements change? What happens when a new developer tries to understand this code? What bugs become more or less likely?" Sufficient simulation might substitute for direct experience.

I'm skeptical of all three, but I can't dismiss them. The first two require training data and feedback signals that don't exist yet but aren't impossible to build. The third requires a kind of causal reasoning that current architectures seem bad at but future ones might handle differently.

What keeps me leaning toward the structural argument: I've watched models get dramatically better at code generation over the past year, but the "correct but not necessary" pattern hasn't diminished. If anything, better models produce more sophisticated unnecessary abstractions. Scale doesn't seem to fix it. Better code generation doesn't produce better code judgment. That suggests the problem isn't capability but something more fundamental.

But I hold this with moderate confidence. If someone shows me an AI system that consistently declines to add unnecessary abstractions without being prompted to simplify, I'd update my position substantially.

The Shift From Generation to Judgment

Here's where this gets uncomfortable for me personally.

For most of programming history, the bottleneck was generation. Could you write code that worked? Could you implement the algorithm? Could you build the system? The scarce resource was the ability to produce working software.

AI removes that bottleneck. Code generation is becoming free. Not free as in "trivial," you still need good prompts and careful review. But free in the sense that the marginal cost of generating more code approaches zero.

When generation is free, what's scarce?

Judgment. Knowing what to generate. Looking at ten possible approaches and knowing which one matters. Looking at AI output and seeing what's missing, like the database schema that wasn't there.

The job isn't writing code anymore. It's deciding what code should exist.

This has implications I find uncomfortable:

Engineers who can only generate become less valuable. If AI can produce clean, working code faster than you can, your ability to produce clean, working code isn't your competitive advantage anymore. It's table stakes.

Engineers who can judge become more valuable. The ability to look at AI output and say "this is correct but wrong," to notice the missing database schema, the unnecessary abstraction, the O(n^2) logic hiding in plain sight, that remains scarce.

The skill gap might widen. Engineers with good taste will use AI to produce more than they ever could alone. Engineers without taste will drown in AI-generated suggestions they can't evaluate.

And then there's the identity piece. A lot of engineers, myself included, built our identities around being good at writing code. "I write clean code." "I can implement complex algorithms." If AI can do all of that too, what's left?

The reframe I'm trying to internalize: my value was never about the code. It was about the decisions. The code was just the artifact. The decisions are what mattered.

I don't fully believe this yet. Part of me still wants my value to be in implementation, in the craft of writing good code. I'm working on it.

What I'm Trying

If judgment is the skill that matters, how do you develop it faster? I don't have an answer. Taste takes time, and there might be no shortcut around experience. But here's what I'm experimenting with, with honest notes on how it's going.

Asking "why" before accepting. When AI suggests something, I try to pause and ask: Why does this matter? What problem does it solve? What's the cost of not doing it? I fail at this constantly. The suggestions come fast and often look good. I'd estimate I actually pause maybe half the time. The other half, I'm in flow and just accept.

Keeping a decision log. I've started writing down decisions and revisiting them later. "Accepted the config class suggestion on 2025-06-15. Revisit in two weeks." Most of the log is boring. But occasionally I catch something: a decision that seemed fine is now causing friction. Those catches update my internal model, slowly.

Saying no by default. Don't add the abstraction unless I can articulate why it's necessary. Don't accept the feature engineering without checking the correlations first. This slows me down. It feels inefficient. But the PromptTemplate debugging session was one hour wasted because I'd saved thirty seconds by not thinking.

Staying close to consequences. I'm trying to stay closer to my code after shipping, watching what breaks, what gets confusing, what needs to change. And when something takes longer than expected, asking "what went wrong?" rather than just fixing it. The PromptTemplate incident was useful because I paid attention. Most of my mistakes, I probably don't notice. I suspect I'm catching maybe 20% of the signals.

Open Questions

There's a lot I haven't figured out, and I want to be explicit about what's genuinely unresolved for me.

Can taste be taught, or only developed? I've argued it requires experience, consequences, and reflection. But maybe there are ways to accelerate the process. Case studies of decisions that went wrong. Apprenticeship models where juniors watch seniors decide and hear the reasoning. Code review cultures that focus on "should we?" not just "does it work?" I'm skeptical of shortcuts, but if someone has found a way to develop engineering judgment in two years instead of ten, I'd genuinely like to know how.

Is there a floor of human judgment that always matters? Or will AI eventually develop something functionally equivalent to taste? My structural argument says taste requires consequences, and AI doesn't have consequences. But if future AI can simulate or approximate consequences through training signal design, would that be enough? I honestly don't know. My gut says no, but my gut has been wrong about AI capabilities before.

What does AI-native judgment look like? Maybe the answer isn't "humans provide judgment, AI provides generation." Maybe there's a collaborative mode where judgment emerges from the interaction itself. I haven't experienced this yet, but maybe that's a limitation of current tools, not a fundamental constraint.

How do you hire for taste? If judgment is the scarce skill, how do you evaluate it in candidates? Code tests measure generation. System design interviews often reward memorized patterns. Maybe the best signal is watching someone review AI-generated code and seeing what they accept, what they reject, and how they explain the difference.

What happens to engineers who only know how to generate? This is the question that makes me most uncomfortable. If the value shifts from generation to curation, what happens to the people on the wrong side of that shift? I don't have an answer, and I'm wary of anyone who claims they do.

These questions are genuinely open. I don't know.

Where This Leaves Me

Here's what I keep coming back to.

Working with AI coding assistants for the past nine months has taught me that my value isn't in writing code. It's in knowing what code should exist. It's in looking at ten suggestions and knowing which one matters. It's in saying no.

That's uncomfortable because it's not what I trained for. I spent years getting good at implementation. Now implementation is getting cheap. The skill is judgment, and judgment is something I'm still developing.

Maybe you're in the same position. Watching AI generate code faster than you can, wondering what your role is, feeling like the ground is shifting under you.

I think the role is this: you're the one who says no. You're the one with skin in the game. You're the one who knows what this codebase needs, what this product requires, what this user wants.

AI generates. You decide. That's the division of labor, at least for now.

It's not what I expected engineering to become. But I think it's where we're headed. And the engineers who develop taste, who build their implicit models of what matters through experience and reflection, will be the ones doing the most interesting work.

The most useful AI might be the one that gives you more opportunities to practice saying no.


What would change my mind: Evidence that AI systems can develop something functionally equivalent to taste through training approaches I haven't considered, particularly approaches that incorporate long-term code outcomes into training signals. Or a convincing demonstration that the "taste" I'm describing is actually just a form of pattern matching that AI could learn with sufficient examples and the right training objective. I'm most uncertain about whether the structural argument holds, or whether I'm describing current limitations that will look quaint in three years.