Memory for AI Agents, Short Term, Long Term, and What to Store Where

Memory for AI Agents, Short Term, Long Term, and What to Store Where

May 17, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — Three tiers. A short-term buffer in the context window, an episodic store for prior runs, and a semantic store for distilled facts. Most projects build only one and wonder why their agent forgets.

The first agent I built had a context window and nothing else. Every conversation started from zero. Customers found this charming for about a day, then quietly stopped using it. The fix was memory, but “give the agent memory” turns out to be three different problems wearing one trench coat. Conflating them is the most common architectural mistake I see.

Memory for agents breaks down by lifetime and structure. Short-term memory is what’s in the context window right now, message-shaped, ephemeral. Episodic memory is the record of prior runs, transcript-shaped, retrievable. Semantic memory is the distilled facts the agent should “know,” knowledge-shaped, queryable. Each lives in a different store, has a different cost profile, and answers a different question.

This post is the memory architecture I’d ship for a customer-facing agent in May 2024. Postgres for the durable parts, pgvector for embeddings, and a deliberate strategy for what goes where. No vector database religion, no “just shove everything in a vector store” hand-waving.

Short-term, the context window

The simplest tier and the one frameworks help with most. LangChain’s ConversationBufferMemory and equivalents handle the basics: keep recent messages, prepend them to each new call, drop the oldest when you hit a token cap.

from langchain.memory import ConversationTokenBufferMemory
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4-turbo")

memory = ConversationTokenBufferMemory(
    llm=llm,
    max_token_limit=4000,
    return_messages=True,
)

memory.save_context({"input": "My account ID is 9981."}, {"output": "Got it, 9981."})
memory.save_context({"input": "What was my account ID again?"}, {"output": "9981."})
print(memory.load_memory_variables({}))

For longer sessions, plain buffer-with-truncation throws away useful context. The fix is summarization. ConversationSummaryBufferMemory keeps recent messages verbatim and summarizes older ones in place. It’s the right default for any agent expected to handle sessions longer than a few minutes.

The tradeoff with summarization is that it costs an LLM call to maintain. Run it on a separate, cheaper model (claude-3-haiku, gpt-3.5-turbo) and only when the buffer crosses your token threshold. Don’t summarize on every turn.

Episodic, the prior-runs store

This is the tier most projects skip and then regret. When a user comes back tomorrow and says “remember when we talked about the refund last week?”, the short-term buffer is empty. You need the transcript of last week’s conversation, retrievable by user and topic.

The shape that works is a Postgres table with the conversation transcript, metadata, and an embedding of a summary. Don’t embed the raw transcript; embed a one-paragraph summary the agent generates at the end of each session.

CREATE TABLE agent_sessions (
    id BIGSERIAL PRIMARY KEY,
    user_id TEXT NOT NULL,
    started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    ended_at TIMESTAMPTZ,
    transcript JSONB NOT NULL,
    summary TEXT NOT NULL,
    summary_embedding vector(1536) NOT NULL,
    tags TEXT[] DEFAULT '{}'
);
CREATE INDEX ON agent_sessions USING ivfflat (summary_embedding vector_cosine_ops);
CREATE INDEX ON agent_sessions (user_id, started_at DESC);

At the start of a new session, you fetch the last N sessions for the user (cheap, B-tree on user_id, started_at) and the top K by cosine similarity to whatever the user just said (pgvector with the ivfflat index). You feed the summaries, not the transcripts, into the context. The transcripts are there for the rare case where the agent needs to drill in.

import psycopg
from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    return client.embeddings.create(model="text-embedding-3-small", input=text).data[0].embedding

def fetch_relevant_sessions(user_id: str, query: str, k: int = 3) -> list[dict]:
    q_emb = embed(query)
    with psycopg.connect(DSN) as conn, conn.cursor(row_factory=psycopg.rows.dict_row) as cur:
        cur.execute(
            """
            SELECT id, summary, started_at,
                   1 - (summary_embedding <=> %s::vector) AS similarity
            FROM agent_sessions
            WHERE user_id = %s AND ended_at IS NOT NULL
            ORDER BY summary_embedding <=> %s::vector
            LIMIT %s
            """,
            (q_emb, user_id, q_emb, k),
        )
        return cur.fetchall()

The detail to get right is the summary itself. It needs to be written from the agent’s perspective and include both what happened and what the user wanted. “User asked about refund for order ord_abc1234567, approved by agent on May 12 for $42, reason: damaged item.” That’s a useful summary. “Discussed refund.” is a useless one.

Semantic, the facts store

Semantic memory is what the agent “knows” about a user, an account, a product. It’s not a conversation; it’s structured (or semi-structured) data the agent retrieves on demand. User preferences, account flags, recent activity, product specs.

The mistake I see here is putting this into the vector store. Don’t. Vector retrieval is for fuzzy semantic matches. Structured facts belong in a relational table with proper keys.

CREATE TABLE user_facts (
    user_id TEXT NOT NULL,
    key TEXT NOT NULL,
    value JSONB NOT NULL,
    source TEXT NOT NULL,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (user_id, key)
);

The agent doesn’t query this with embeddings. It queries with a tool call. “Look up fact timezone for user 9981.” That’s a function call, not a vector search. The model is good at deciding when to call that tool; you don’t need to retrieve facts proactively for every turn.

The exception is a small “user profile” blob that gets loaded into the system prompt unconditionally. Name, account tier, timezone, language preference. Cheap, useful, always relevant. Anything beyond that should be retrieved on demand.

Putting the tiers together

A request flow that uses all three.

def handle_message(user_id: str, message: str) -> str:
    profile = fetch_profile(user_id)
    sessions = fetch_relevant_sessions(user_id, message, k=2)
    system = build_system_prompt(profile, sessions)
    short_term = load_short_term(user_id)
    messages = [{"role": "system", "content": system}] + short_term + [
        {"role": "user", "content": message}
    ]
    response = call_agent(messages)
    save_short_term(user_id, message, response)
    return response

def build_system_prompt(profile: dict, sessions: list[dict]) -> str:
    parts = [
        "You are a customer support agent.",
        f"User profile: {profile['name']}, tier {profile['tier']}, timezone {profile['timezone']}.",
    ]
    if sessions:
        parts.append("Relevant prior sessions:")
        parts.extend(f"- {s['started_at']:%Y-%m-%d}: {s['summary']}" for s in sessions)
    return "\n".join(parts)

The full system prompt is around 200 tokens of profile and 300 tokens of prior session summaries. Affordable, useful, doesn’t bloat the context window. The agent has access to lookup tools for anything else.

For more on how memory interacts with the agent loop itself, see /blog/react-reflexion-planner-executor-agent-loops/. The pgvector project’s README covers indexing tradeoffs in more depth if you’re new to it; ivfflat is fine for the volumes most agent projects see.

Common Pitfalls

The patterns that look right until they’re under load.

Embedding raw transcripts. Wastes storage, makes retrieval noisy. Embed summaries.
Treating the vector store as the system of record. It’s an index. Your durable data lives in Postgres rows.
Loading prior sessions into context unconditionally. You’ll blow your context budget for sessions that aren’t relevant. Always filter by similarity threshold.
Forgetting to write the summary at session end. No summary, no episodic memory tomorrow. Make it part of the session-close flow, not optional.
Using a small embedding dimension to save cost. Embedding cost is dominated by request count, not dimension. Use a model that gives you 1536 or higher; the recall difference is real.
Letting facts go stale. A user_facts row from six months ago about a discontinued product is worse than no fact. Add a TTL or a refresh strategy.
Storing PII in embeddings. You can’t easily delete a specific user’s data from an ivfflat index in place. Plan for deletion from day one.

Wrapping Up

The three-tier model isn’t novel and isn’t mine. It’s the loose consensus that emerged from agent teams in 2023 and that the better frameworks now reflect. The reason it works is that it maps to the actual shape of agent needs. Recent messages now. Prior conversations sometimes. Structured facts on demand. Each tier has a different cost, different access pattern, different consistency requirement.

The mistake I keep flagging in code reviews is teams picking one tier and trying to stretch it across all three responsibilities. A buffer can’t do episodic. A vector store can’t do structured facts. A facts table can’t do conversation history. Build all three or be honest that your agent doesn’t really remember anything.

The good news is that the storage technologies for all three are boring and well-understood. Postgres for the durable bits. Redis or in-memory for the buffer. Embedding model of your choice. Nothing here requires a managed vector platform or an exotic database. The infrastructure is the easy part; deciding what to write where is the work.

Short-term, the context window

Episodic, the prior-runs store

Semantic, the facts store

Putting the tiers together

Common Pitfalls

Wrapping Up

Related posts

pgvector Tuning in 2024, HNSW and IVFFlat in Production

Evaluating LLM Agents, From Vibes to Regression Suites

Cost Control for LLM Agents, Token Budgets and Anthropic Prompt Caching

Guardrails for LLM Agents in 2024, Llama Guard, Rebuff, and NeMo

ReAct, Reflexion, and Planner Executor, Agent Loop Patterns That Work

Multi Agent Conversations with AutoGen, Patterns and Pitfalls

Designing Tools for LLM Agents, Function Schemas That Survive Production

Production Agents with LangGraph, State Machines Over Chains

Let’s Start a Project