How AI Agent Memory Actually Works Under the Hood

Most agents forget you the moment the conversation ends. The session closes, the context window clears, and the next time you type a message you're a stranger again. The reason isn't a design failure-it's the default behavior of every large language model. The model has no state. What looks like "memory" in a polished product is usually a small amount of careful engineering sitting on top of that stateless core.

Understanding how that engineering actually works changes how you build with it and what you trust agents to do.

The context window is short-term memory, full stop

AI agent memory refers to a system's ability to retain, recall, and use information from past interactions. It integrates short-term memory-which is just the context window, holding recent conversation turns-with long-term memory, which is persistent storage of facts, preferences, or learned behaviors outside the model.

The context window does everything during a single session. Working memory acts as the agent's scratchpad for active information manipulation during a task-the mental workspace where an agent holds relevant details while solving a problem. When the window fills up or the session ends, that scratchpad is gone.

This is fine for a one-shot task. It breaks immediately for anything that spans time. If you ask an agent to help with a project on Monday, then follow up on Thursday, the agent doesn't know about Monday unless something has explicitly saved that context somewhere and retrieved it before Thursday's session begins. Nothing about the model does this automatically.

What "long-term memory" actually means in production

Agents need three types of long-term memory: semantic, episodic, and procedural. Each serves a distinct cognitive function.

Semantic memory is facts about the world or the user-preferences, constraints, domain knowledge. Semantic memory stores what an agent knows about a user: facts, preferences, and constraints that hold across time. A CRM agent that remembers "Budget cap $50K" doesn't need the user to repeat themselves every session. When new information contradicts the old-"Budget raised to $75K"-the entry is updated rather than duplicated.

Episodic memory is the history of what happened. Episodic memory stores what happened-specific interactions logged with enough context to be useful later. A support agent with episodic memory knows this is the third time a user has opened a ticket about the same integration. That changes the response.

Procedural memory is how to do things-learned strategies, formatting preferences, debugging patterns. A coding assistant needs procedural memory for learned debugging strategies.

Memory ≠ vector DB: the right architecture uses layered memory-working, summaries, artifacts, and long-term preferences. Most teams start by reaching for a vector database and calling that "memory." It covers the semantic and episodic cases reasonably well, but procedural memory usually needs structured storage with different access patterns, and none of these systems automatically decide what's worth keeping.

How memory gets in and out: the tool call loop

Here's the part that surprises people who haven't built agents before: an agent doesn't write to memory by magic. It does it with a tool call.

When you send a tool schema to an LLM, you're not "registering" a function. You're injecting a JSON blob into the model's context window, formatted as a system message. The model is then fine-tuned to output a special token sequence that signals a tool call. This is why the schema counts against your token budget-it's literally part of the prompt.

So the flow for a memory-enabled agent looks like this: the model reads the conversation, decides a fact is worth saving, and emits a structured JSON object-something like {"tool": "save_memory", "args": {"content": "user prefers async standup format"}}. The LLM itself does not execute the function. Instead, it identifies the appropriate function, gathers all required parameters, and provides the information in a structured JSON format. This JSON output can then be deserialized into a function call and executed within the program's runtime environment.

Your application code catches that output, calls the actual memory service-Pinecone, pgvector, Redis, whatever you're using-and stores the fact. On the next session, the application queries the memory store before sending anything to the model, retrieves relevant records, and injects them at the top of the prompt. The model "remembers" because your code loaded the memory in.

The extract-then-update cycle has two core stages: extraction and update. The system uses the LLM with context to extract key facts from new conversations. Then, through a tool call, it executes ADD, UPDATE, DELETE, or NOOP instructions to dynamically maintain memory consistency.

The write side has a latency problem worth knowing about. One practical approach exposes two async tool functions-addMemories and retrieveMemories-that the agent calls through a function-calling system. Memory writes are async, so they don't add to response latency.

The hard problem: deciding what to remember

The architecture above works. The harder question is what goes in.

The JSON schema you define is not just documentation-it's the only thing the model sees. If your descriptions are vague, the model will hallucinate arguments. A poorly described save_memory tool will either save everything (noise) or miss things that matter. This is why teams investing seriously in agent memory spend a lot of time on their tool descriptions and extraction prompts, not on their choice of vector database.

Temporal reasoning is the open frontier. A 15-point gap between architectures on temporal queries reflects a genuine divide. Tools built on pure vector similarity are structurally limited in answering "what did the agent know last Tuesday?" without additional infrastructure. Timestamped graph approaches close this gap but add operational complexity.

Context pollution-where irrelevant information degrades reasoning quality-means you need strategies to compress and organize memories. Retrieval is not neutral. Pulling in too much saved context crowds the context window and can actively hurt the response quality you were trying to improve. A teammate like Beagle, which operates inside Slack channels where conversation history is dense and varied, has to be careful about what it surfaces-not everything a channel has discussed is relevant to the current question.

There's also the forgetting problem. The noise floor problem is underaddressed. As one production benchmark notes: "None of these systems solve the fundamental challenge: deciding what to remember and what to forget." Most systems accumulate. They add and update but rarely prune. Over time, stale facts linger, and a confident agent starts referencing a preference the user changed six months ago.

The honest engineering position right now: memory architecture for agents is genuinely unsolved at the edge cases. The core read/write loop via tool calls is well-understood. The judgment layer-what matters, when to forget it, how to answer "what did you know last week?"-is still more craft than science.

Understanding that split is the most useful thing you can carry into building with this stack. The plumbing is reliable. The curation is where the work lives.

The context window is short-term memory, full stop

What "long-term memory" actually means in production

How memory gets in and out: the tool call loop

The hard problem: deciding what to remember

Keep reading

Stop Picking AI Models From the Top of a Leaderboard

Claude Tag puts an always-on AI teammate in Slack

Open Source AI Coding Agent OpenCode Hits 172k Stars as Copilot Bills by Token