Here's a common failure mode with conversational agents: the system works well for the first two turns, then on the third turn, it asks the user a question they've already answered. Users don't just dislike this. They're offended by it. Being asked to repeat yourself by a machine that should, by all appearances, know better feels different from being asked by a human. It feels dismissive.

The natural reaction is that the model needs better memory. Some kind of memory system, maybe a vector store, maybe a persistent conversation log. Something that would let the model "remember" what the user already said.

But the model has no memory. It never did. What it has is a context window, and in most deployments, that window is being managed badly.

The Fix That Reframes the Problem

The fix for a forgetful agent isn't a memory system. It's better context engineering. Restructure what gets passed into the model at each turn so it can act as if it remembers. Instead of dumping the raw conversation history into the prompt and hoping for the best, build a structured summary that gets updated every turn. Something like: "user sentiment: frustrated, prior questions asked: [list], issues raised: [list], resolution status: pending."

The model doesn't remember that the user is frustrated. It reads a structured note that says the user is frustrated. From the outside, the behavior looks identical. The user gets a coherent, context-aware response. But mechanically, what's happening is completely different from remembering.

It's tempting to treat this as an implementation detail. Who cares how the model "knows" something, as long as the output is right? But the more you build agents across different domains, the more this distinction starts to feel like the interesting part.

When we engineer context to simulate memory, are we faking something real, or are we accidentally stumbling into how memory actually works?

What RAG Taught Me About Relevance

Before getting to that question, it's worth examining what RAG actually delivers versus what it promises. The pitch is simple: store information in a vector database, retrieve what's relevant, inject it into context. Your agent now has "memory" that scales beyond the context window. In practice, it's far messier than the pitch suggests.

Retrieval is noisy. Top-k results often return documents tangentially related but not actually useful. The model confidently references information from the wrong context, weaving irrelevant facts into plausible-sounding responses. There's no hesitation, no uncertainty signal.

The real problem wasn't technical tuning. It was conceptual. Retrieval only works if you can define "relevant" beforehand, at index time, before you know what queries will come. But what counts as relevant changes with every query, every user, every turn. A piece of information that's irrelevant in turn two might be exactly what's needed in turn seven.

RAG assumes memory works like a filing cabinet: you store things, you look them up, you use them. But the filing cabinet keeps returning the wrong files, not because the search is broken, but because what counts as "the right file" depends on what you're trying to do right now. Relevance is context-dependent, not a fixed property of documents.

I still use RAG. The point isn't that retrieval is bad. The point is that it reveals a deeper problem: we keep trying to reduce memory to storage plus retrieval, and that reduction keeps falling short in ways that seem structural rather than fixable.

The Human Memory Parallel That Won't Leave Me Alone

I picked up a book on memory science (Elizabeth Loftus's work on false memories, mainly) and was struck by something I'd vaguely known but never really internalized: human memory isn't a database either.

We don't store experiences and retrieve them faithfully. We reconstruct. Every time you "remember" something, your brain is generating a plausible version of the past based on fragments, associations, and your current context. The memory you have of your tenth birthday isn't a recording. It's a creative act, shaped by everything that's happened to you since, by what someone asked you about it, by your mood when you recalled it.

This is why eyewitness testimony is so unreliable. It's not that witnesses are lying. It's that memory genuinely works this way. The act of remembering is reconstruction, not replay.

And the moment I sat with that idea, I couldn't shake the parallel.

An LLM with a well-engineered context window is doing something structurally similar. It's not "remembering" previous turns of the conversation. It's constructing a plausible continuation given whatever context it's been fed. The fidelity of its "memory" depends entirely on what context you provide, just like human memory depends on environmental cues, emotional state, and what questions get asked.

This is the idea that crystallized for me over several months of building agents: memory is better understood as a process than a thing. It's not a data store you read from. It's an active process of reconstruction, and the quality of the output depends on the quality of the available context. A human in a richly cued environment (familiar place, specific smells, a conversation that triggers associations) will "remember" more than the same human in a sterile room. An agent with well-structured context will behave more coherently than the same agent with a raw conversation dump.

But how seriously should I take this analogy? I keep going back and forth. Part of me thinks I'm pattern-matching too aggressively, finding similarity where the mechanisms are completely different. And when I look at the mechanisms honestly, the differences are significant.

Human memory reconstruction involves emotional weighting: the amygdala tags experiences with salience, so a frightening encounter gets encoded differently than a mundane one. LLMs have no analog. Every token in the context window has equal standing unless an engineer explicitly structures it otherwise.

Human memory has temporal decay and consolidation. You forget most of what happened yesterday, but sleep-replay strengthens important memories into long-term storage. Repetition and emotional intensity determine what persists. LLMs have flat context with no decay. A message from turn two and a message from turn twenty carry the same weight unless you build explicit recency heuristics.

Human reconstruction is shaped by the body. Embodied cognition research suggests that physical states (posture, heartbeat, gut feelings) influence what and how we remember. LLMs are entirely disembodied.

So why do I still think the analogy is useful? Because despite these mechanistic differences, both systems face the same information-theoretic constraint: you can't store everything, so you reconstruct from lossy compression. Both approximate the past rather than replaying it, fill gaps with pattern-completion, and produce outputs shaped as much by the current recall context as by the original experience. The convergence might be superficial. But it might point to something fundamental about what memory has to be when storage is finite and reconstruction is the only option.

The "Context Is the Product" Realization

The model isn't the product. The context is the product.

This plays out concretely in practice. A less capable model with excellent context engineering (stable system prompts, well-organized tool schemas, structured memory summaries) can consistently outperform a frontier model with sloppy context management. Same API, same pricing tier. The bottleneck isn't intelligence. It's information availability: the right information, structured well, at the right moment.

As I wrote in my earlier context engineering essay, "the agent is in the context." But now I'd extend it: the agent's memory is in the context too. Its competence, its personality, its apparent expertise. All properties of the context, not of the model. If that's true, the people who create the most value aren't necessarily building better models. They're building better scaffolding: the infrastructure of artificial memory.

Where Context-as-Memory Breaks Down

I want to be honest about the limits of this framing, because I think there are real ones.

The most obvious: context windows are finite. No matter how well you engineer context, you can only fit so much information into a prompt. Human memory, for all its reconstructive imperfection, draws on a lifetime of encoded experience. An agent's "memory" starts fresh with each session unless you've built elaborate external storage systems.

There's also a scalability problem I haven't solved. As the amount of external state grows, the challenge of deciding what to include becomes harder, not easier. You end up building a system that needs to "remember" what's worth remembering, a meta-memory problem that feels suspiciously circular.

But the deepest limit is about weight updates. Almost everything we call "AI memory" today (RAG, conversation logs, structured summaries, tool state) changes what the model sees, not the model itself. Fine-tuning is the exception: it actually updates weights, making it closer to learning. But fine-tuning is expensive, slow, and carries the risk of catastrophic forgetting.

This creates an interesting tension. Children have poor working memory but excellent long-term learning: each experience changes them. LLMs have vast compressed knowledge but no individual learning: the model stays fixed while the context changes. Agent systems are attempting a third path: keep the model frozen, but build enough external scaffolding that the overall system behaves as if it learns.

Context engineering gives you deliberate, structured recall: the kind of memory where you consciously look something up. What it can't give you is automatic, below-conscious-awareness memory, the kind where you just "know" things without being able to explain why. Think about how an experienced doctor walks into a room and immediately senses something is wrong before consciously processing any symptoms. That recognition is built from thousands of weight-updating encounters. Context engineering can give an agent the doctor's notes, but not the doctor's instincts.

The Counter-Argument I Take Seriously

The strongest pushback to what I just wrote comes from builders of systems like MemGPT (now Letta) and similar hierarchical memory architectures. Their position: with sufficient scaffolding (vector stores, structured state, episodic logs, meta-memory systems), there is no ceiling to context-as-memory. You can replicate anything weight updates do.

Their best argument is specific and technical. Hierarchical memory management can replicate consolidation, moving important information from working memory to long-term storage. Recency-weighted retrieval can replicate temporal decay. Importance scoring functions can replicate emotional salience. If you build the right external architecture, you get the functional equivalent of biological memory processes without needing to touch the weights.

I find this more compelling than I want to. But I'm not fully convinced. The biological processes they're replicating (consolidation, decay, salience-tagging) are themselves adaptive. The human brain doesn't just consolidate memories during sleep, it adjusts how it consolidates based on what's been useful. The salience-tagging system recalibrates. The decay curves shift with context. These meta-level adaptations are emergent properties of a system that updates its own processing through experience.

Engineered memory scaffolding handles the cases its designer anticipated. When it encounters a genuinely novel memory challenge, it falls back on fixed heuristics. It doesn't adapt its own memory strategy. That meta-adaptiveness is what I think is missing, and I'm not sure external scaffolding can replicate it without becoming so complex that it's effectively a second learning system layered on top of the first.

I could be wrong. If I am, we should see it in the benchmarks within a few years.

What Would True Memory Look Like?

If I try to imagine an LLM-based system with genuine memory, not context-as-memory but something that actually updates its processing based on experience, it faces three hard problems.

First, continual learning. Integrating new information without losing old capabilities remains unsolved in any general way, despite interesting work (C-Flat creating flatter loss landscapes, VERSE preserving past knowledge through virtual gradients). Models either forget old things when learning new ones, or they become rigid and resist updates.

Second, consolidation. What's worth encoding permanently versus discarding? Humans solve this through sleep and repetition. I wonder if there's an analog for AI systems, some automated process that reviews recent context and decides what should influence the model's weights.

Third, safety. This worries me most. A system that continuously updates based on experience could drift unpredictably. If a customer support agent genuinely "learned" from every interaction, it might learn manipulation tactics from hostile users, or develop biases from non-representative samples. The static nature of current LLMs is actually a safety feature, even if we don't usually frame it that way.

The Questions I Can't Resolve

I feel fairly confident that memory is reconstruction, not replay, and that context, not the model, is where most of the value currently lives. I'm uncertain about whether context-as-memory has a hard ceiling or whether better scaffolding can always close the gap, and whether the human memory analogy is genuinely illuminating or just pattern-matching.

Here's how I'd make this less abstract, in three predictions I'm willing to be wrong about:

If context-as-memory has a hard ceiling, I'd expect to see it show up first in tasks requiring cross-session learning, situations where an agent needs to integrate patterns across 50 or more independent interactions and use those patterns to change its behavior without being explicitly told to. If I'm wrong and context engineering can replicate weight updates indefinitely, we should see agents with purely external memory matching fine-tuned models on personalization benchmarks (things like user preference prediction or adaptive tutoring) within the next two years. My bet is that we'll hit the ceiling, but later and higher than most people expect.

If memory-is-reconstruction is the right frame, agents with better-structured context summaries should outperform agents with larger context windows on multi-turn coherence benchmarks. Specifically: a 32k-context agent with structured state management should beat a 128k-context agent with raw conversation dumps on 20+ turn conversations. If that doesn't hold, it suggests raw capacity matters more than structure, and the reconstruction analogy is less useful than I think.

If the weight-update ceiling is real, fine-tuned personal assistants should plateau above RAG-based ones on implicit user preference prediction within two years. The gap should be largest on preferences the user never explicitly stated but consistently acted on. If RAG-based systems match fine-tuned ones even on implicit preferences, it suggests context engineering can substitute for learning more fully than I expect, and I'll need to revise my intuition about where the ceiling sits.

And there's one question that keeps nagging at me: are we asking the right question at all? When I ask "how do we give AI memory?", I'm assuming memory is something a system either has or doesn't have. But maybe memory is more like a spectrum. Maybe what I've been calling "context-as-memory" isn't fake memory or a workaround. Maybe it's a real form of memory, just a different kind than what biological systems have. The context window is to an LLM what sensory input is to a brain: the raw material from which experience is constructed.

If that's right, then context engineering isn't just optimization. It's building the substrate of artificial cognition. Not faking memory, but discovering a different kind of it, one that lives outside the model rather than inside, that persists in scaffolding rather than in synapses. The oldest forms of human memory (oral traditions, written records, institutional knowledge) work the same way: external to any single brain, reconstructed each time they're accessed, shaped by the context of their retrieval. We've been doing this longer than we think.