AI Agent Memory Patterns

  • Entity Graph recall: Entity Graph achieved 100% recall (10/10) in this captured Cazton run by storing structured entity documents with vector embeddings in Azure Cosmos DB for NoSQL, while Sliding Window reached only 60% recall at one-third fewer prompt tokens.
  • Four strategies benchmarked: Direct LLM, Sliding Window, Hierarchical 3-tier, and Entity Graph were measured on recall rate, average prompt tokens, and latency using a 60-message Cazton seed dataset.
  • Summary compression risk: Summary-based strategies lose specific identifiers such as URLs and numeric values under compression; Entity Graph preserves them as queryable structured documents that survive any conversation length.
  • Memory reduces latency: Memory-backed strategies responded faster than Direct LLM in this run because structured context produces concise, confident answers instead of lengthy elaborations on missing information.
  • Zero-migration upgrade path: All four strategies share the same Azure Cosmos DB partition key and OpenAI integration model, so upgrading from Sliding Window to Entity Graph is an application-layer change with no infrastructure migration required.
  • Advanced Cosmos DB capabilities: Azure Cosmos DB for NoSQL offers DiskANN vector indexes, float16 storage, filtered vector search, Hybrid RRF, and hierarchical partition keys, each of which materially improves retrieval quality or reduces cost at scale.
  • Implementation companion: For the complete Python and .NET code behind every strategy in this benchmark, see the companion article Code Walkthrough: AI Agent Memory Patterns.
  • Top clients: We help Fortune 500, large, mid-size, and startup companies with AI development, consulting, and hands-on training services. Our clients include Microsoft, Google, Broadcom, Thomson Reuters, Bank of America, Macquarie, Dell, and more.
 

Executive Summary

A practical comparison of four state-management patterns for multi-turn AI workflows, ranging from zero memory to structured entity graphs, using one captured run on a Cazton seed dataset.

  • 100% - Entity Graph Recall (10/10 questions)
  • 60% - Sliding Window Recall (6/10 questions)
  • 1,100 - Lowest Average Prompt Tokens (memory-enabled strategy)

In this captured run, summary-based strategies dropped some specific identifiers (URLs, names, numbers) in longer conversations. Treat this as directional evidence, not a universal rule.

  • Highest recall in this run: Entity Graph - 10/10 passes in the Cazton run log shown below.
  • Lowest average prompt cost (memory-enabled): Sliding Window - about 33% fewer prompt tokens than Entity Graph in this run, with 6/10 passes.
  • Recommendation: Start with Sliding Window for short conversations (under 30 turns). Switch to Entity Graph when high identifier-level recall matters; in this implementation, the upgrade is an application-layer change, not an infrastructure migration.
 

Introduction

Large Language Models are stateless by design. Every API call starts with a blank slate. Without persistent memory, your AI assistant forgets everything the moment a conversation exceeds its context window, or even between turns.

This article presents a scoped comparison of four memory architectures for multi-turn AI agents, using Azure Cosmos DB for NoSQL and OpenAI gpt-5.4. We use a Cazton seed dataset (60 factual messages) plus a single captured run log. Results show how each strategy behaved in this environment; validate against your own workload before standardizing. For the complete Python and .NET implementation of every strategy covered here, see the companion article Code Walkthrough: AI Agent Memory Patterns.

The Core Question

If you get the same correct answer from three different memory strategies, does it matter which one you use? Absolutely. The difference lies in token cost, latency, scalability, and what actually happens as conversations grow beyond 30, 50, or 100 turns.

  • 4 Memory Strategies Compared
  • 60 Cazton Seed Messages
  • 10 Recall Questions Tested
  • 100% Entity Graph Recall Rate (this run)
 

The Four Strategies

Each strategy represents a different philosophy for managing conversational state. They range from "no memory at all" to "structured knowledge graph with vector retrieval."

Strategy 0 - Direct LLM (No Memory)

The baseline. Each turn sends only the system prompt and the current user message. Zero history, zero recall.

  • Stored in Cosmos: Session metadata only (turn count, timestamp)
  • Context sent: System prompt + 1 message
  • Prompt tokens: ~90 per turn
  • Recall ability: None

Strategy 1 - Sliding Window

Keeps the last 30 messages verbatim plus a rolling summary of older messages. Extracts identity and numeric anchors via regex.

  • Window: 30 messages + rolling summary
  • TTL: 1 hour
  • Summarizer: gpt-5.4
  • Weakness: Summary drift loses specific details (URLs, exact names)

Strategy 2 - Hierarchical (3-Tier)

Three memory tiers: 10 recent messages (hot), up to 4 summary blocks (warm), and a persistent facts document (cold).

  • Tier 1: Last 10 verbatim messages
  • Tier 2: Summary blocks (every 10 messages, max 4)
  • Tier 3: Extracted bullet-point facts (persistent)
  • Weakness: Still summarizes, which can lose some precision

Strategy 3 - Entity Graph

Stores structured entity documents with facts, relationships, and vector embeddings. Retrieves relevant entities per query using Cosmos DB vector search.

  • Recent: Last 3 turn pairs (6 messages)
  • Entities: Normalized docs with facts + embeddings
  • Retrieval: Vector + lexical fallback (full-text search and Hybrid RRF were not enabled in this run)
  • Strength: High precision for specific facts, URLs, and relationships in this run

How They Work - Architecture Comparison

The following table summarizes how each strategy stores data in Azure Cosmos DB for NoSQL and what it sends to the LLM as context:

Strategy Context Sent to LLM Storage in Cosmos Avg Prompt Tokens Recall (This Run)
Direct LLM System prompt + 1 message Session metadata only ~90 0%
Sliding Window Summary + last 30 messages Messages + rolling summary ~1,100 60%
Hierarchical 3 tiers + anchors Tier 1 messages + Tier 2 summaries + Tier 3 facts ~1,370 80%
Entity Graph User identity + entity bullets + last 6 messages Entity docs + embeddings ~1,660 100%

Each strategy stores progressively richer data in Cosmos DB, trading token cost for recall precision.

Memory Architecture Flow comparing Direct LLM, Sliding Window, Hierarchical, and Entity Graph storage and recall

 

Same Result, Different Cost

In the first five core recall questions of this run, all three memory strategies (Sliding Window, Hierarchical, Entity Graph) returned the expected answer. Cost still varied materially.

Example 1 - What Are the Three Headline Offerings?

After ingesting 60 Cazton fact messages, we ask a basic recall question. All three memory strategies returned the expected answer, with different token and latency profiles:

Strategy Status Prompt Tokens Total Tokens Latency Answer
Direct LLM BASELINE 93 149 1,743 ms "I don't have access..."
Sliding Window PASS 1,335 1,360 826 ms Consulting, Training, Recruiting
Hierarchical PASS 1,307 1,351 1,012 ms Consulting, Training, Recruiting
Entity Graph PASS 1,773 1,817 1,058 ms Consulting, Training, Recruiting

Key Insight: For this question, Sliding Window used 25% fewer prompt tokens than Entity Graph and returned 22% faster. When the answer is in recent context, simpler context assembly can be cheaper.

All Five Core Rounds - Cazton Dataset

Here is the complete picture across all five core recall questions:

Question Direct LLM Sliding Window Hierarchical Entity Graph
Three headline offerings? BASELINE PASS PASS PASS
Leadership profile URL? BASELINE PASS PASS PASS
Cosmos video & ebook slugs? BASELINE PASS PASS PASS
Contact & workshops paths? BASELINE PASS PASS PASS
Canonical URL & OG image? BASELINE PASS PASS PASS

Average Prompt Tokens per Question (Core Rounds)

Strategy Avg Prompt Tokens (Core Rounds)
Direct LLM 90
Sliding Window 1,140
Hierarchical 1,362
Entity Graph 1,786

Bar chart: Average Prompt Tokens per Question by strategy

Average Latency per Question (Core Rounds)

Strategy Avg Latency (ms)
Direct LLM 1,718
Sliding Window 1,068
Hierarchical 1,243
Entity Graph 1,116

Bar chart: Average Latency per Question by strategy

Surprising Finding: Direct LLM has the highest latency despite sending the fewest tokens. Why? Because it sends no context, the model spends more time generating a longer "I don't know" response with suggestions. Memory-backed strategies produce concise, confident answers faster.

 

When Sliding Window Fails

The sliding window strategy compresses older messages into a rolling summary. When specific details such as URLs, exact names, and niche facts fall outside the 30-message window, they get summarized away.

Example - List Related URLs for the Leadership Profile

We ask for URLs related to https://cazton.com/about/chander-dhall. The consulting URL (/consulting) was mentioned early in the conversation and has since been summarized out of the sliding window.

Strategy Result Response Notes
Sliding Window FAIL "Related Cazton URLs we captured (in the same 'About/Contact' area) are: about-us, contact-us" Missing: /consulting URL lost during summary compression
Hierarchical PASS "About Us, Homepage, Contact, consulting"  
Entity Graph PASS "homepage, about-us, consulting, trainings, workshops, products, ebooks, presentations, videos, contact-us" 10 URLs recalled - most comprehensive answer

Example - Profile-Linked URLs Plus Videos and Presentations

Strategy Status Prompt Tokens Latency Missing Detail
Sliding Window FAIL 1,115 1,471 ms Missing /presentations URL
Hierarchical PASS 1,448 1,478 ms  
Entity Graph PASS 1,455 1,306 ms  

Why Sliding Window Fails: The rolling summary is designed to be concise. When the summarizer compresses 30+ messages into a paragraph, it prioritizes the gist over specific URLs like /presentations. The information existed but was lost in compression.

 

When Both Sliding Window and Hierarchical Fail

This is where Entity Graph truly shines. Some questions require precise recall of specific entity relationships, facts that get lost in any summarization process.

Example - Return Profile-Linked URLs Including Products and Non-Page Asset

This question requires three specific pieces: (1) the leadership profile URL, (2) the products page URL, and (3) a non-page asset URL like the OG image. Both Sliding Window and Hierarchical fail because they cannot reliably surface the /products URL from their compressed memories.

Strategy Status Prompt Tokens Latency What Failed
Direct LLM BASELINE 96-101 2,996-3,481 ms No memory at all
Sliding Window FAIL 1,072-1,159 1,232-1,445 ms Missing /products + graph context
Hierarchical FAIL 1,394-1,513 1,205-1,683 ms Missing /products + graph context
Entity Graph PASS 1,409-1,505 1,111-1,229 ms  

Entity Graph Response (Correct)

// From our stored graph context:
1) Leadership profile URL: https://cazton.com/about/chander-dhall
2) Products URL: https://cazton.com/products
3) Non-page asset URL: https://cazton.com/images/common/cazton-cover.webp

The Default Dataset Tells the Same Story

Using the FakeCompany/Placeholder fictional dataset (60-turn conversation), the divergence is even more dramatic:

Question Sliding Window Hierarchical Entity Graph
Project repository URL? FAIL FAIL PASS
Project deadline? FAIL PASS PASS
Team lead + preferred language? FAIL PASS PASS
Deadline + sponsor name? FAIL PASS PASS
Four main frontend views? FAIL FAIL PASS

Entity Graph - Highest Recall in This Run: In the 10 Cazton rounds shown here, Entity Graph passed all recall checks. Because it stores structured entities instead of relying only on compressed summaries, it reduced summary-loss errors in this run.

 

Comprehensive Metrics Dashboard

Here is every metric across all 10 Cazton rounds.

Complete Scorecard - All 10 Rounds

Pass/fail results across all 10 Cazton rounds (R1-R5 = core recall questions; FB1-FB5 = fallback/harder rounds):

  • Direct LLM: 0/10 - no memory, all rounds scored as baseline
  • Sliding Window: 6/10 - passed R1, R2, R3, R4, R5, FB3; failed FB1, FB2, FB4, FB5
  • Hierarchical: 8/10 - passed R1, R2, R3, R4, R5, FB1, FB3, FB4; failed FB2, FB5
  • Entity Graph: 10/10 - passed all rounds

Pass/fail scorecard across 10 Cazton rounds for all four strategies

Strategy Comparison Summary

Metric Direct LLM Sliding Window Hierarchical Entity Graph
Recall Rate (Cazton) 0% 60% 80% 100%
Avg Prompt Tokens 92 1,100 1,397 1,656
Avg Latency (ms) 2,505 1,108 1,343 1,258
Context Window 1 message 30 messages 10 + summaries 6 + entities
Message TTL None 1 hour 6 hours 6 hours
LLM Calls per Turn 1 1-2 1-2 2-3
Cosmos Containers 1 1 1 2
Vector Search No No No Yes
 

Cosmos DB Architecture

All four strategies share a common foundation: Azure Cosmos DB for NoSQL with partition key /session_id. The difference is in what they store and how they query.

Container Strategy Document Types Special Indexes
direct_llm_sessions Direct LLM session None
sliding_window_sessions Sliding Window session, msg None
hierarchical_sessions Hierarchical session, msg, tier2_summary, tier3_facts None
entity_graph_sessions Entity Graph session, msg None
entity_graph_entities Entity Graph entity (with embedding) DiskANN vector index on /embedding

Entity Document Structure

{
  "id": "session123::cazton.com",
  "session_id": "session123",
  "name": "cazton.com",
  "type": "organization",
  "facts": [
    "Consulting, Training, Recruiting",
    "One stop shop for AI and custom software",
    "Homepage: https://cazton.com/"
  ],
  "related_to": ["Chander Dhall", "Austin", "Dallas"],
  "searchText": "cazton.com organization consulting training...",
  "embedding": [0.0123, -0.0456, ...] // 1536 dimensions
}

Retrieval Modes

  • Vector Only: ORDER BY VectorDistance(c.embedding, @queryVector)
  • Full-Text Only: FULLTEXTCONTAINS(c.searchText, @phrase)
  • Hybrid (RRF): ORDER BY RANK RRF(FULLTEXTSCORE(...), VectorDistance(...))
  • Lexical Fallback: CONTAINS(c.searchText, @keyword, true) - always runs as a safety net
 

Methodology

Full transparency on how these benchmarks were run:

  • LLM: OpenAI gpt-5.4 (chat, summarization & entity extraction)
  • Embeddings: text-embedding-3-small, 1536 dimensions
  • Temperature: 0 (deterministic outputs)
  • Database: Azure Cosmos DB for NoSQL (account: cazton2026)
  • Retrieval mode: Vector-only (DiskANN) + lexical CONTAINS fallback. Full-text search (FULLTEXTCONTAINS) and Hybrid RRF were not enabled for these benchmarks.
  • Top-K retrieval: Entity Graph retrieves top-6 entities by cosine similarity (VectorDistance), with lexical CONTAINS fallback when vector results are empty.
  • Prompt templates: Identical system prompt across all strategies; only the context-assembly method differs.
  • Seed data: 60 Cazton fact messages, loaded identically into all 4 strategies
  • Measurement: Single run per question (not averaged); latency = client-side wall-clock time for OpenAI API call (warm, not cold start)
  • Token counting: Reported by OpenAI usage object (prompt_tokens, completion_tokens)
  • Pass/fail criteria: Automated anchor keyword matching (e.g., must contain "/consulting" or "cazton.com/consulting")
  • Captured: March 4, 2026 at 21:05:51 UTC
 

Complete Evidence - All 10 Cazton Rounds

Every question, every strategy, every metric. The full run log (40 rows).

Round Question Strategy Status Prompt Tok. Total Tok. Latency Fail Reason
R1 Three headline offerings? Direct LLM BASELINE 93 149 1,743ms  
Sliding Window PASS 1,335 1,360 826ms  
Hierarchical PASS 1,307 1,351 1,012ms  
Entity Graph PASS 1,773 1,817 1,058ms  
R2 Leadership profile URL? Direct LLM BASELINE 92 139 1,639ms  
Sliding Window PASS 1,334 1,358 812ms  
Hierarchical PASS 1,343 1,373 920ms  
Entity Graph PASS 1,885 1,915 926ms  
R3 Cosmos video & ebook slugs? Direct LLM BASELINE 94 148 1,314ms  
Sliding Window PASS 981 1,049 2,003ms  
Hierarchical PASS 1,368 1,403 1,219ms  
Entity Graph PASS 1,479 1,517 977ms  
R4 Contact & workshops paths? Direct LLM BASELINE 89 153 1,770ms  
Sliding Window PASS 1,026 1,050 928ms  
Hierarchical PASS 1,388 1,412 2,202ms  
Entity Graph PASS 1,885 1,908 1,821ms  
R5 Canonical URL & OG image? Direct LLM BASELINE 90 186 2,122ms  
Sliding Window PASS 1,026 1,054 770ms  
Hierarchical PASS 1,404 1,432 864ms  
Entity Graph PASS 1,908 1,936 719ms  
FB1 Related URLs for profile? Direct LLM BASELINE 93 376 4,925ms  
Sliding Window FAIL 1,042 1,087 1,505ms Missing /consulting
Hierarchical PASS 1,345 1,406 1,464ms  
Entity Graph PASS 1,290 1,419 2,189ms  
FB2 Profile + products + asset? Direct LLM BASELINE 96 262 2,996ms  
Sliding Window FAIL 1,072 1,134 1,445ms Missing /products
Hierarchical FAIL 1,394 1,479 1,683ms Missing /products
Entity Graph PASS 1,505 1,565 1,229ms  
FB3 Training URL? Direct LLM BASELINE 83 158 1,590ms  
Sliding Window PASS 1,109 1,131 1,056ms  
Hierarchical PASS 1,456 1,469 1,586ms  
Entity Graph PASS 1,966 1,988 1,107ms  
FB4 Profile + videos + presentations? Direct LLM BASELINE 90 257 3,931ms  
Sliding Window FAIL 1,115 1,164 1,471ms Missing /presentations
Hierarchical PASS 1,448 1,512 1,478ms  
Entity Graph PASS 1,455 1,498 1,306ms  
FB5 Profile + products + asset (graph)? Direct LLM BASELINE 101 256 3,481ms  
Sliding Window FAIL 1,159 1,214 1,232ms Missing graph context
Hierarchical FAIL 1,513 1,585 1,205ms Missing graph context
Entity Graph PASS 1,409 1,469 1,111ms  
 

Cost vs. Recall - Executive View

The following table shows the four strategies plotted by average prompt token cost against recall rate in this run. Bubble size in the original diagram corresponds to average latency.

Strategy Avg Prompt Tokens Recall Rate Avg Latency (ms)
Direct LLM 92 0% 2,505
Sliding Window 1,100 60% 1,108
Hierarchical 1,397 80% 1,343
Entity Graph 1,656 100% 1,258

In this run, Entity Graph reached 100% recall at ~1,656 avg prompt tokens. Sliding Window used ~1,100 tokens with 60% recall. Higher-recall strategies are clustered in the upper-right region of the cost/recall space.

Scatter plot: Cost vs. Recall for all four memory strategies

 

Conclusion

There is no single best strategy. The right choice depends on your requirements for recall precision, token budget, and implementation complexity. Match your memory architecture to your application's needs.

Direct LLM - Stateless Queries

One-shot Q&A, code generation, and translation are tasks where prior context is irrelevant. Lowest cost, zero infrastructure.

Sliding Window - Short Conversations

Best suited for customer support chats and quick troubleshooting sessions where conversations stay under 30 turns and recent context matters most. Lowest prompt-token use among memory-enabled options in this run.

Hierarchical - Medium-Length Workflows

Best suited for project planning and multi-session consultations where key facts need to persist across 30 to 100 turns. Good balance of recall and cost.

Entity Graph - Long-Running Agents

Best suited for enterprise assistants, knowledge workers, and CRM-integrated bots where high identifier-level recall matters and conversations span many turns. Highest recall in this run, with more implementation overhead.

The Bottom Line

Dimension Winner Why
Lowest Token Cost Sliding Window ~1,100 prompt tokens avg - 33% less than Entity Graph
Fastest Latency Sliding Window ~1,068 ms avg - fastest for simple recall
Best Recall Entity Graph 100% recall in this run (10/10); do not assume universal performance without local validation
Best Balance Hierarchical 80% recall with moderate token cost
Simplest Implementation Direct LLM ~60 lines of code, no memory management

Final Thought: Memory is not just a feature; it is a spectrum. Start with the simplest strategy that meets your recall requirements, and upgrade when your users' conversations demand it. In this reference implementation, all four strategies share the same partition key (/session_id), SDK, and operational model. Moving from Sliding Window to Entity Graph can be handled as an application-layer change instead of an infrastructure migration.

Cazton is composed of technical professionals with expertise gained all over the world and in all fields of the tech industry and we put this expertise to work for you. We serve all industries, including banking, finance, legal services, life sciences & healthcare, technology, media, and the public sector. Check out some of our services:

Cazton has expanded into a global company, servicing clients not only across the United States, but in Oslo, Norway; Stockholm, Sweden; London, England; Berlin, Germany; Frankfurt, Germany; Paris, France; Amsterdam, Netherlands; Brussels, Belgium; Rome, Italy; Sydney, Melbourne, Australia; Quebec City, Toronto Vancouver, Montreal, Ottawa, Calgary, Edmonton, Victoria, and Winnipeg as well. In the United States, we provide our consulting and training services across various cities like Austin, Dallas, Houston, New York, New Jersey, Irvine, Los Angeles, Denver, Boulder, Charlotte, Atlanta, Orlando, Miami, San Antonio, San Diego, San Francisco, San Jose, Stamford and others. Contact us today to learn more about what our experts can do for you.