The Agent That Rewrites Its Own Memory Between Sessions

Most agent failures are not model failures. The model is capable enough. The problem is that each new session starts cold. The agent has done this codebase, this client, this filing format a hundred times - and has no idea.

Anthropic shipped a partial answer to that problem on May 6 at Code with Claude in San Francisco. They called it Dreaming.

Dreaming is a scheduled process that reviews past agent sessions, surfaces patterns, and curates memory so agents improve between runs. Recurring mistakes, shared workflows, and team preferences get pulled into a more useful memory store.

That is the mechanism. The significance is in what it replaces: manual prompt engineering, hardcoded context files, and the frustrated engineer who keeps adding "remember to check for null values in the response schema" to the system prompt every time the agent forgets.

Crucially, Dreaming does not modify the underlying model weights. "We're not changing the model itself through dreaming - it's not doing updates to the weights or anything like that." Instead, the agent writes learnings as plain-text notes and structured "playbooks" that future sessions can reference, making the entire process observable and auditable by humans.

That distinction matters more than it might look. Plain-text notes a human can read and delete are different from a black-box fine-tune. You can inspect what the agent decided to remember. You can correct it. You can see when something went wrong in a prior session and made it into the playbook.

The early results are real, if self-reported. Legal AI company Harvey saw task completion rates increase roughly 6x after implementing Dreaming. Medical document review company Wisedocs cut document review time by 50% using Outcomes, the companion grading feature. And Netflix is now processing logs from hundreds of builds simultaneously using multi-agent orchestration.

Harvey's 6x number is the one getting attention, but read it carefully. The 6x claim from Harvey is the headline number, but it is also the one Anthropic is publishing without an external benchmark to back it up. Practitioners will want to know whether Harvey's tests are representative of broader agent workflows or specific to the long-form legal-drafting tasks where the company already had a clear pre-dreaming failure mode.

That is not a reason to dismiss the feature. It is a reason to pilot it on your own failure mode rather than assume the number transfers.

Where does Dreaming actually earn its keep? The behavior matters most for what Anthropic calls long-running work: jobs that span many sessions, days, or operators, where the same agent or fleet of agents keeps returning to the same problem space. The classic failure mode for an LLM agent in that setting is that every session starts fresh, so every session relearns the same recurring quirks.

Think code review agents that work across hundreds of PRs in a monorepo. Think support triage that handles the same category of tickets daily. Think internal tools that process recurring financial reports. In all of these, the patterns are stable and the agent keeps forgetting them. Dreaming is the right shape of solution for that class of problem.

It is a worse fit for one-off tasks and low-frequency workflows. If an agent runs once a month, it has almost nothing to learn from. The memory store stays sparse, and the overhead of the scheduled review process adds more than it gives back.

A lead agent can delegate to specialist subagents working in parallel on a shared filesystem, each with its own model, prompt, and tools. The whole flow is traceable in the Claude Console. This is the part that should interest engineering leads building multi-agent pipelines. The observability story - being able to see which agent did what and why - is where most teams currently have the biggest blind spot. A teammate like Beagle can surface that audit trail into Slack, but the data has to exist somewhere first. Managed Agents now produces it by default.

The security concern Anthropic's documentation quietly flags deserves more attention than it gets in the launch coverage. A concern raised by some security researchers is that giving agents structured persistent memory expands the attack surface for prompt-injection and memory-poisoning attacks. If a malicious input can convince an agent that the wrong instruction is the right one, Dreaming may consolidate that wrong instruction into the agent's long-term memory store, where it will be applied to future sessions automatically. Anthropic's documentation flags the issue and recommends human review on memory updates for high-stakes workflows.

The governance question sitting underneath all of this: who owns the agent's memory? Is it the platform team that deployed the agent? The team whose codebase it works in? The engineer who wrote the original system prompt? Anthropic's tooling lets you inspect and edit the memory store, but it does not answer who is responsible for reviewing it. That is a process question, not a product question. Teams deploying Managed Agents with Dreaming enabled should answer it before the first session runs.

The five features Anthropic shipped - Dreaming, Outcomes, multi-agent orchestration, Claude Finance with 10 pre-built agents, and Add-ins - are a more honest map of where the real competition in AI has moved. The frontier model race has quieted relative to the harness race. Codex versus Claude Code is a more meaningful contest right now than GPT versus Opus.

That framing is right. The teams getting the most out of agentic coding tools right now are not the ones who switched models last month. They are the ones who have spent the past quarter building memory management, output grading, and failure-pattern tracking by hand. Dreaming, Outcomes, and the rest of the Managed Agents stack are Anthropic productizing what those teams figured out on their own.

The feature is in research preview. The governance tooling is early. But the underlying idea - that an agent running the same class of work repeatedly should get measurably better at it over time without someone rewriting the prompt - is sound, and this is the first implementation of it that is observable enough to trust in production.

Keep reading

Publish Your ai-catalog.json Before Agents Hard-Code Around You

Agent Code Execution Runs in a Kernel Nobody Shares

Cursor in Slack Now Plans Before It Codes