Code Walkthrough: AI Agent Memory Patterns

Python implementation walkthrough: All four memory strategies are implemented in Python with the same Azure Cosmos DB partition design and metrics interface, covering Sliding Window, Hierarchical 3-Tier, Entity Graph, and a Direct LLM baseline.
Identity and numeric anchoring: Sliding Window uses regex pattern matching to extract critical facts like identity, budgets, and dates, then stores them separately so they are never lost when older messages get compressed into a rolling summary.
Hierarchical storage tiers: Hierarchical 3-tier memory stores recent messages verbatim, mid-range history as compressed summaries, and the oldest content as distilled facts, each in a separately queryable tier in Azure Cosmos DB so performance stays consistent as sessions grow.
Entity extraction and merging: Entity Graph runs LLM-based structured JSON extraction plus regex hints on each user message, merges new facts with existing entity documents in Cosmos DB, and regenerates embeddings only when the searchText changes.
Adaptive retrieval modes: The Entity Graph retrieval engine supports vector-only, full-text-only, and Hybrid RRF modes in Azure Cosmos DB for NoSQL with adaptive top-K and a CONTAINS-based lexical fallback, providing reliable recall across all query types.
Production extraction reliability: A four-attempt JSON extraction fallback chain doubles token budgets on retry and switches to a fallback model on failure, ensuring reliable entity extraction for production AI agent deployments.
Benchmark companion: This implementation walkthrough is the companion to AI Agent Memory Patterns, which benchmarks all four strategies on recall, token cost, and latency against a live Cazton seed dataset.
Top clients: We help Fortune 500, large, mid-size, and startup companies with AI development, consulting, and hands-on training services. Our clients include Microsoft, Google, Broadcom, Thomson Reuters, Bank of America, Macquarie, Dell, and more.

A guided walkthrough of the four memory strategy implementations. Each section covers the actual Python behind Sliding Window, Hierarchical 3-Tier, and Entity Graph retrieval, built on Azure Cosmos DB for NoSQL. This article is the implementation companion to AI Agent Memory Patterns, which benchmarks all four strategies on recall, token cost, and latency against a live Cazton seed dataset.

Before we look at any strategy, we need to understand how data moves in and out of Cosmos DB. All four strategies share the same Pydantic models; the three stateful strategies share the same message-document layer, while direct_llm.py only persists session metadata.

Every chat request specifies a strategy. Every response returns standardized metrics including token counts, latency, and how many turns are stored versus sent. This is how we compare apples-to-apples across strategies.

Python - models.py

class Strategy(str, Enum):
    direct_llm     = "direct_llm"
    sliding_window = "sliding_window"
    hierarchical   = "hierarchical"
    entity_graph   = "entity_graph"

class Metrics(BaseModel):
    prompt_tokens:      int
    completion_tokens:  int
    total_tokens:       int
    latency_ms:         float
    memory_turns_stored: int
    context_turns_sent:  int

Why this matters: Every strategy returns the same Metrics object. The memory_turns_stored vs. context_turns_sent gap reveals compression efficiency: Entity Graph might store 60 turns but only send the last 6 recent messages plus targeted system context. The demo UI surfaces these numbers after every message.

The three stateful strategies store messages as individual documents with sequence numbers. This allows efficient range queries such as "give me messages 20 through 30", which is critical for Sliding Window, Hierarchical, and Entity Graph.

Python - cosmos_messages.py

async def upsert_message(container, *, session_id, seq, role, content, ts, ttl_s=None):
    doc = {
        "id":         f"msg:{seq}",
        "session_id": session_id,
        "doc_type":   "msg",
        "seq":        int(seq),
        "role":       role,
        "content":    content,
    }
    if ttl_s is not None:
        doc["ttl"] = int(ttl_s)      # Cosmos auto-deletes after this many seconds
    await container.upsert_item(doc)

Design Decision: TTL on Messages. Each message has a ttl field. Cosmos DB natively deletes expired documents without requiring cron jobs or cleanup scripts. The Sliding Window uses a 1-hour TTL; the Hierarchical and Entity Graph strategies use 6 hours. This keeps storage lean without application-level garbage collection.

Reading messages back uses Cosmos DB SQL queries scoped to a single partition (session_id). There are two access patterns:

Python - cosmos_messages.py

# Pattern 1: "Give me the most recent N messages"
async def read_recent_messages(container, *, session_id, limit):
    query = (
        f"SELECT TOP {limit} c.seq, c.role, c.content FROM c "
        "WHERE c.session_id = @sid AND c.doc_type = 'msg' "
        "ORDER BY c.seq DESC"          # newest first, then reverse
    )
    ...

# Pattern 2: "Give me messages between seq 20 and 30"
async def read_messages_by_seq_range(container, *, session_id, start_seq, end_seq):
    query = (
        "SELECT c.seq, c.role, c.content FROM c "
        "WHERE c.session_id = @sid AND c.doc_type = 'msg' "
        "AND c.seq >= @start AND c.seq < @end "
        "ORDER BY c.seq ASC"
    )

Partition key = session_id. Every hot-path query is scoped to a single logical partition, which avoids cross-partition scans. The three stateful strategies share this message layer; direct_llm.py keeps only session metadata.

The simplest strategy: no memory at all. Each turn sends only the system prompt and the current user message. This is the control group we measure everything else against.

Python - direct_llm.py

async def chat(sessions_container, openai_client, session_id, user_message, ...):
    session = await read_or_default(sessions_container, session_id, default={...})
    session["turn_count"] = session["turn_count"] + 1

    messages = [
        {"role": "system",  "content": settings.system_prompt},
        {"role": "user",    "content": user_message},    # That's it. Just 1 message.
    ]

    reply, usage, latency_ms = await create_chat_completion(openai_client, model=..., messages=messages)
    await upsert_item(sessions_container, session)     # Save turn_count only
    return reply, Metrics(context_turns_sent=1, ...)

Captured Response Excerpt

Q: "What are the three headline offerings?"
A: "I don't have access to the Cazton homepage title in this chat, so I can't extract
   the three headline offerings from it."

93 prompt tokens in this captured run | no prior-turn recall

83 lines in the file. No history, no retrieval. The LLM cannot answer recall questions because it never sees prior messages. For this captured Cazton round, prompt_tokens was 93. Everything above that baseline is the cost of memory.

Keep the last 30 messages verbatim. When older messages fall off the window, summarize them into a rolling snapshot. Anchor critical facts (identity, budgets, tech stack) via regex so they survive summarization.