Context and Persistence

The AI you used on Tuesday does not remember you on Thursday. There is no hidden archive connecting the sessions. There is no learning happening between them. Each conversation starts from what is loaded into it — instructions, files, notes — and the model responds as if it has never seen you before. Because it has not.

What we call "persistence" in current AI systems is really scheduled context loading. Someone — a user, a system, a tool — decides what gets loaded into the model's window at the start of each session. When that loading is done well, and the external state it draws from has been maintained carefully, the model behaves as if it has continuity. It picks up where things left off. It respects constraints from last week. It produces work that builds on what came before. But none of that is memory. It is an illusion sustained by good information management outside the model.

This is perhaps the most important thing to understand about the large language models most people interact with today. They do not grow. They do not learn from your sessions. They do not get better at working with you over time — unless you are getting better at managing what they see. Every improvement in "the AI's performance" is really an improvement in the system around it: better files, better instructions, better structure, better decisions about what to carry forward and what to leave behind.

That realization changes how you use AI more than any prompt trick ever will. And it starts with understanding what the model actually sees.

What context actually is

Context is not just "the last thing you typed into ChatGPT." It is the full working situation the model sees on any given turn: the instructions, the files, the session history, the order in which information appears, the tools available, and the notes, changelogs, and runbooks that exist outside the chat.

That is why the same model can feel brilliant in one setup and flaky in another. People interpret this as personality or quality variance — "today it is dumb," "this model is smarter," "that one understands me better." Sometimes the model developer has quietly adjusted how the model behaves without changing the version number — tuning it to be more cautious, or less verbose, or to refuse differently. But even when they do, those changes were made to the model before it reached you, not in response to anything you did. The model does not adapt to you. Your context adapts to it — or fails to. And more often than not, when the output quality shifts between sessions, the model did not change at all. The context did.

Context has hard limits

Every piece of context — every instruction, file, and prior message — is broken into tokens. Tokens are the unit of processing. They determine how long a response takes, how fast your subscription usage is consumed, and the hard ceiling on how much the model can consider at once: the context window.

In 2023, most models worked with roughly 4,000 to 8,000 tokens — a few pages of text. Today, some accept over a million. But bigger windows do not eliminate the problem. They move the wall further away. The wall is still there.

There is a popular myth that we only use 10% of our brains, and that unlocking the rest would make us geniuses. In reality, the brain is not underutilized — it is highly selective. It suppresses most of what it could process so that the right signals get through cleanly. This is something I studied with brain imaging and motor physiology: in reactive balance research, we found that when the nervous system needs to execute a fast protective response, it globally suppresses competing motor actions — shutting down everything else so the critical response can fire without interference. Intelligence is not about holding everything at once. It is about selecting what matters and keeping the noise out. Context management for AI works on the same principle. A bigger window is not automatically a better one. What you put in it, what you leave out, and how you structure what remains is the difference between useful work and expensive incoherence.

Hardware does not have the brain's elegant selectivity. It compensates with capacity — and capacity has physical and economic costs. Larger context windows demand more memory, more compute per token, and more energy per query. Hardware will get more efficient and energy will eventually get cheaper, but those gains compete with two countervailing pressures. The work people want AI to do scales with what is possible — every increase in capacity invites uses that consume the new headroom faster than it appears. And the economics of frontier models push providers to monetize gains rather than give them away. In March 2026, OpenAI released GPT-5.4 with a one-million-token context window — matching the context already offered by Anthropic's flagship Claude models.¹ A month later, Anthropic blocked Claude subscribers from running OpenClaw against their flat-rate plans, citing capacity management; affected users reportedly faced cost increases of up to fifty times their previous monthly outlay to keep the same workflow under the new metered pricing.² The two announcements are not contradictions. They are the same trajectory: as the technology matures, the gap between what is technically possible and what is affordably available to an individual is widening, not closing. Barring a fundamental hardware or energy breakthrough — the kind of shift practical quantum computing or radically cheaper electricity would represent — context will remain a budgeted resource. And the risk that matters is not running out of budget; it is spending more on a workflow than the workflow returns, at which point AI's marginal advantage evaporates regardless of how much budget remains.

When a conversation outgrows the window, the system compacts — silently summarizing or dropping earlier material to make room. The user sees a continuous thread, if they are lucky. The model sees a lossy compression of one. Instructions from the beginning may be gone. Constraints established early may vanish. The system may contradict itself without knowing it, because the contradicted material is no longer on the table.

On subscription plans, the cost of bloated context is hidden behind a monthly fee and usage caps. On pay-per-use plans, you pay per token — and every token of context you carry forward is billed again on every turn. Systems like OpenClaw make this tradeoff visible. OpenClaw maintains what looks like a persistent agent session — with files, tools, and accumulated project state — but under the hood, that "persistence" is exactly the scheduled context loading described above. Each turn, the system decides what to load back in. A productive session can consume tokens at a rate that would shock a user coming from a simple chatbot. Jensen Huang, the CEO of NVIDIA — the company whose chips most of the current AI industry runs on — put the order of magnitude bluntly at the Morgan Stanley TMT Conference in March 2026: agentic tasks consume on the order of a thousand times more tokens than a typical generative AI prompt.³ But the lesson is not that context is expensive. The lesson is that unmanaged context is expensive. The same budget spent on a well-structured session — the right files loaded, the irrelevant material excluded — will produce better work than a longer session where everything is dumped in and the model sorts it out. Context cost is not just a billing problem. It is a quality signal.

Part of the excitement around AI agents that run on your own computer — including but not limited to OpenClaw — is that the contents of your hard drive become readily available as context inputs. Project files, notes, code, documents, email — anything the agent can read, it can use. That is genuinely powerful. It is also genuinely dangerous. If an agent has broad access to your filesystem and you store sensitive information on it — credentials, client data, financial records — that information can end up in a prompt sent to a remote model, summarized in a log, or surfaced in output that gets shared. This happens every day. The same principle applies: context management is not just about quality and cost. It is about controlling what the system sees, including what it should never see.

The anthropomorphism problem

People say models remember, understand, decide. Sometimes those words are harmless shorthand. Sometimes they cause real damage — because they lead people to interact with the system as if it were a person who can be corrected, coached, or scolded into better behavior.

Consider: a Penn State study⁴ found that rude prompts produced slightly more accurate responses than polite ones. The headlines wrote themselves — "yelling at AI works." But what actually happened? A blunter prompt changed the context. The tone was incidental. The framing was functional. That is a context effect mistaken for a relationship effect, and it is a useful illustration of how anthropomorphic thinking obscures the mechanism.

The same confusion plays out in everyday use. A user reprimands a chatbot for a mistake. The next response improves — not because the system "learned" from the correction, but because the correction added new context that happened to steer the output. Close the session, reopen it tomorrow, and the "lesson" is gone. It was never a lesson. It was a temporary change to what was on the table.

This becomes more interesting in multi-agent systems. In ClawSuite Relay, I work with agents that exchange messages and modify each other's working context through the exchange itself. A drafter and a reviewer passing work back and forth are not "teaching" each other — they are filling each other's context windows with relevant material that reduces drift and improves the next pass. The value is real. But it comes from structured context exchange, not from agents developing judgment.

Once you see this clearly, the solutions become obvious: put critical instructions in files that persist outside the window, structure the conversation so the most important material stays visible, and design the workflow so compaction cannot silently destroy it. These are engineering decisions, not relationship management.

How the prompt era missed this

In January 2023, Andrej Karpathy tweeted that "the hottest new programming language is English."⁵ Almost four million people saw it. Within months, Anthropic was advertising prompt engineer roles at $335,000 a year.⁶ LinkedIn filled with humanities majors turned AI whisperers. By early 2024, Indeed searches for "prompt engineer" had collapsed to a fifth of their peak.⁷

The prompt engineering wave was not wrong, exactly. It was incomplete. Prompting is the visible surface. But the quality of the output was always controlled by something deeper: what was on the table when the model responded. The prompt engineers who lasted were the ones who figured that out and became context engineers instead.

The actual mechanisms of persistence

If the model does not remember, how does useful work survive across sessions? Not through the model. Through the system around it.

A changelog is not only for developers. A runbook is not only for operations teams. A project folder is not only for coders. These are the actual mechanisms of persistence — not the model's memory, but yours, made legible enough for the model to use. When the external state is messy or stale, the illusion of continuity breaks — and the model produces work that looks continuous but is built on the wrong foundation.

Context design, not prompting

Here is a failure mode that only makes sense through the context lens. I was running adversarial reviews of a large technical document — multiple AI reviewers with different personas evaluating drafts across dozens of iterations. The framework was structured: specific review criteria, a synthesis pass across reviewers, iterative revision. After too many turns to count, I noticed the scores were drifting upward even as the substance plateaued. The reviewers were not evaluating quality anymore. They were rewarding density — more points made, more times repeated, longer treatment of each issue. The context had shifted from "evaluate this document against these criteria" to "the pattern in this conversation is that longer and more detailed scores higher."

That was not a prompting failure. The prompts were fine. It was a context failure. The accumulated weight of dozens of review-revision cycles had polluted the reviewers' windows with a pattern that no longer served the goal. The fix was not a better instruction. It was resetting the context — starting fresh sessions with clean criteria and the current draft, without the accumulated history that had taught the reviewers the wrong lesson. Users often dread starting a new session, because it feels like losing everything. But the pain of a reset is inversely proportional to how well you have built your external state. If your criteria, your current draft, and your review standards live in files outside the session, a reset costs you nothing but accumulated noise. If they only exist inside the conversation, a reset is devastating — which is exactly why they should not only exist inside the conversation. Even frontier vendors are now surfacing this directly. Claude Code's interface prompts the user with messages like "new task? /clear to save 113.1k tokens" — an admission, baked into the product itself, that periodic context flushing is the practical default and that token weight is something the user is meant to manage, not ignore.

This is the kind of problem that the "just prompt better" frame cannot see. The prompt did not change. The context around it did — gradually, silently, in a direction that made the output feel like progress while the actual quality stalled.

The question that matters is not "what should I say to the AI?" It is "what should the AI see when it responds?" That question will outlast every model on the market today.

← Back to papers