AI Agent Memory Patterns

Entity Graph recall: Entity Graph achieved 100% recall (10/10) in this captured Cazton run by storing structured entity documents with vector embeddings in Azure Cosmos DB for NoSQL, while Sliding Window reached only 60% recall at one-third fewer prompt tokens.
Four strategies benchmarked: Direct LLM, Sliding Window, Hierarchical 3-tier, and Entity Graph were measured on recall rate, average prompt tokens, and latency using a 60-message Cazton seed dataset.
Summary compression risk: Summary-based strategies lose specific identifiers such as URLs and numeric values under compression; Entity Graph preserves them as queryable structured documents that survive any conversation length.
Memory reduces latency: Memory-backed strategies responded faster than Direct LLM in this run because structured context produces concise, confident answers instead of lengthy elaborations on missing information.
Zero-migration upgrade path: All four strategies share the same Azure Cosmos DB partition key and OpenAI integration model, so upgrading from Sliding Window to Entity Graph is an application-layer change with no infrastructure migration required.
Advanced Cosmos DB capabilities: Azure Cosmos DB for NoSQL offers DiskANN vector indexes, float16 storage, filtered vector search, Hybrid RRF, and hierarchical partition keys, each of which materially improves retrieval quality or reduces cost at scale.
Implementation companion: For the complete Python and .NET code behind every strategy in this benchmark, see the companion article Code Walkthrough: AI Agent Memory Patterns.
Top clients: We help Fortune 500, large, mid-size, and startup companies with AI development, consulting, and hands-on training services. Our clients include Microsoft, Google, Broadcom, Thomson Reuters, Bank of America, Macquarie, Dell, and more.

Executive Summary

A practical comparison of four state-management patterns for multi-turn AI workflows, ranging from zero memory to structured entity graphs, using one captured run on a Cazton seed dataset.

100% - Entity Graph Recall (10/10 questions)
60% - Sliding Window Recall (6/10 questions)
1,100 - Lowest Average Prompt Tokens (memory-enabled strategy)

In this captured run, summary-based strategies dropped some specific identifiers (URLs, names, numbers) in longer conversations. Treat this as directional evidence, not a universal rule.

Highest recall in this run: Entity Graph - 10/10 passes in the Cazton run log shown below.
Lowest average prompt cost (memory-enabled): Sliding Window - about 33% fewer prompt tokens than Entity Graph in this run, with 6/10 passes.
Recommendation: Start with Sliding Window for short conversations (under 30 turns). Switch to Entity Graph when high identifier-level recall matters; in this implementation, the upgrade is an application-layer change, not an infrastructure migration.

Large Language Models are stateless by design. Every API call starts with a blank slate. Without persistent memory, your AI assistant forgets everything the moment a conversation exceeds its context window, or even between turns.

This article presents a scoped comparison of four memory architectures for multi-turn AI agents, using Azure Cosmos DB for NoSQL and OpenAI gpt-5.4. We use a Cazton seed dataset (60 factual messages) plus a single captured run log. Results show how each strategy behaved in this environment; validate against your own workload before standardizing. For the complete Python and .NET implementation of every strategy covered here, see the companion article Code Walkthrough: AI Agent Memory Patterns.

The Core Question

If you get the same correct answer from three different memory strategies, does it matter which one you use? Absolutely. The difference lies in token cost, latency, scalability, and what actually happens as conversations grow beyond 30, 50, or 100 turns.

4 Memory Strategies Compared
60 Cazton Seed Messages
10 Recall Questions Tested
100% Entity Graph Recall Rate (this run)

Each strategy represents a different philosophy for managing conversational state. They range from "no memory at all" to "structured knowledge graph with vector retrieval."

Strategy 0 - Direct LLM (No Memory)

The baseline. Each turn sends only the system prompt and the current user message. Zero history, zero recall.

Stored in Cosmos: Session metadata only (turn count, timestamp)
Context sent: System prompt + 1 message
Prompt tokens: ~90 per turn
Recall ability: None

Strategy 1 - Sliding Window

Keeps the last 30 messages verbatim plus a rolling summary of older messages. Extracts identity and numeric anchors via regex.

Window: 30 messages + rolling summary
TTL: 1 hour
Summarizer: gpt-5.4
Weakness: Summary drift loses specific details (URLs, exact names)

Strategy 2 - Hierarchical (3-Tier)

Three memory tiers: 10 recent messages (hot), up to 4 summary blocks (warm), and a persistent facts document (cold).

Tier 1: Last 10 verbatim messages
Tier 2: Summary blocks (every 10 messages, max 4)
Tier 3: Extracted bullet-point facts (persistent)
Weakness: Still summarizes, which can lose some precision

Strategy 3 - Entity Graph

Stores structured entity documents with facts, relationships, and vector embeddings. Retrieves relevant entities per query using Cosmos DB vector search.

Recent: Last 3 turn pairs (6 messages)
Entities: Normalized docs with facts + embeddings
Retrieval: Vector + lexical fallback (full-text search and Hybrid RRF were not enabled in this run)
Strength: High precision for specific facts, URLs, and relationships in this run

How They Work - Architecture Comparison

The following table summarizes how each strategy stores data in Azure Cosmos DB for NoSQL and what it sends to the LLM as context:

Strategy	Context Sent to LLM	Storage in Cosmos	Avg Prompt Tokens	Recall (This Run)
Direct LLM	System prompt + 1 message	Session metadata only	~90	0%
Sliding Window	Summary + last 30 messages	Messages + rolling summary	~1,100	60%
Hierarchical	3 tiers + anchors	Tier 1 messages + Tier 2 summaries + Tier 3 facts	~1,370	80%
Entity Graph	User identity + entity bullets + last 6 messages	Entity docs + embeddings	~1,660	100%

Each strategy stores progressively richer data in Cosmos DB, trading token cost for recall precision.

In the first five core recall questions of this run, all three memory strategies (Sliding Window, Hierarchical, Entity Graph) returned the expected answer. Cost still varied materially.

After ingesting 60 Cazton fact messages, we ask a basic recall question. All three memory strategies returned the expected answer, with different token and latency profiles:

Strategy	Status	Prompt Tokens	Total Tokens	Latency	Answer
Direct LLM	BASELINE	93	149	1,743 ms	"I don't have access..."
Sliding Window	PASS	1,335	1,360	826 ms	Consulting, Training, Recruiting
Hierarchical	PASS	1,307	1,351	1,012 ms	Consulting, Training, Recruiting
Entity Graph	PASS	1,773	1,817	1,058 ms	Consulting, Training, Recruiting

Key Insight: For this question, Sliding Window used 25% fewer prompt tokens than Entity Graph and returned 22% faster. When the answer is in recent context, simpler context assembly can be cheaper.

Here is the complete picture across all five core recall questions:

Question	Direct LLM	Sliding Window	Hierarchical	Entity Graph
Three headline offerings?	BASELINE	PASS	PASS	PASS
Leadership profile URL?	BASELINE	PASS	PASS	PASS
Cosmos video & ebook slugs?	BASELINE	PASS	PASS	PASS
Contact & workshops paths?	BASELINE	PASS	PASS	PASS
Canonical URL & OG image?	BASELINE	PASS	PASS	PASS

Strategy	Avg Prompt Tokens (Core Rounds)
Direct LLM	90
Sliding Window	1,140
Hierarchical	1,362
Entity Graph	1,786

Strategy	Avg Latency (ms)
Direct LLM	1,718
Sliding Window	1,068
Hierarchical	1,243
Entity Graph	1,116

Surprising Finding: Direct LLM has the highest latency despite sending the fewest tokens. Why? Because it sends no context, the model spends more time generating a longer "I don't know" response with suggestions. Memory-backed strategies produce concise, confident answers faster.

The sliding window strategy compresses older messages into a rolling summary. When specific details such as URLs, exact names, and niche facts fall outside the 30-message window, they get summarized away.

We ask for URLs related to https://cazton.com/about/chander-dhall. The consulting URL (/consulting) was mentioned early in the conversation and has since been summarized out of the sliding window.

Strategy	Result	Response	Notes
Sliding Window	FAIL	"Related Cazton URLs we captured (in the same 'About/Contact' area) are: about-us, contact-us"	Missing: /consulting URL lost during summary compression
Hierarchical	PASS	"About Us, Homepage, Contact, consulting"
Entity Graph	PASS	"homepage, about-us, consulting, trainings, workshops, products, ebooks, presentations, videos, contact-us"	10 URLs recalled - most comprehensive answer

Strategy	Status	Prompt Tokens	Latency	Missing Detail
Sliding Window	FAIL	1,115	1,471 ms	Missing `/presentations` URL
Hierarchical	PASS	1,448	1,478 ms
Entity Graph	PASS	1,455	1,306 ms

Why Sliding Window Fails: The rolling summary is designed to be concise. When the summarizer compresses 30+ messages into a paragraph, it prioritizes the gist over specific URLs like /presentations. The information existed but was lost in compression.

This is where Entity Graph truly shines. Some questions require precise recall of specific entity relationships, facts that get lost in any summarization process.

This question requires three specific pieces: (1) the leadership profile URL, (2) the products page URL, and (3) a non-page asset URL like the OG image. Both Sliding Window and Hierarchical fail because they cannot reliably surface the /products URL from their compressed memories.

Strategy	Status	Prompt Tokens	Latency	What Failed
Direct LLM	BASELINE	96-101	2,996-3,481 ms	No memory at all
Sliding Window	FAIL	1,072-1,159	1,232-1,445 ms	Missing `/products` + graph context
Hierarchical	FAIL	1,394-1,513	1,205-1,683 ms	Missing `/products` + graph context
Entity Graph	PASS	1,409-1,505	1,111-1,229 ms

// From our stored graph context:
1) Leadership profile URL: https://cazton.com/about/chander-dhall
2) Products URL: https://cazton.com/products
3) Non-page asset URL: https://cazton.com/images/common/cazton-cover.webp

Using the FakeCompany/Placeholder fictional dataset (60-turn conversation), the divergence is even more dramatic:

Question	Sliding Window	Hierarchical	Entity Graph
Project repository URL?	FAIL	FAIL	PASS
Project deadline?	FAIL	PASS	PASS
Team lead + preferred language?	FAIL	PASS	PASS
Deadline + sponsor name?	FAIL	PASS	PASS
Four main frontend views?	FAIL	FAIL	PASS

Entity Graph - Highest Recall in This Run: In the 10 Cazton rounds shown here, Entity Graph passed all recall checks. Because it stores structured entities instead of relying only on compressed summaries, it reduced summary-loss errors in this run.

Here is every metric across all 10 Cazton rounds.

Complete Scorecard - All 10 Rounds

Pass/fail results across all 10 Cazton rounds (R1-R5 = core recall questions; FB1-FB5 = fallback/harder rounds):

Direct LLM: 0/10 - no memory, all rounds scored as baseline
Sliding Window: 6/10 - passed R1, R2, R3, R4, R5, FB3; failed FB1, FB2, FB4, FB5
Hierarchical: 8/10 - passed R1, R2, R3, R4, R5, FB1, FB3, FB4; failed FB2, FB5
Entity Graph: 10/10 - passed all rounds

Strategy Comparison Summary

Metric	Direct LLM	Sliding Window	Hierarchical	Entity Graph
Recall Rate (Cazton)	0%	60%	80%	100%
Avg Prompt Tokens	92	1,100	1,397	1,656
Avg Latency (ms)	2,505	1,108	1,343	1,258
Context Window	1 message	30 messages	10 + summaries	6 + entities
Message TTL	None	1 hour	6 hours	6 hours
LLM Calls per Turn	1	1-2	1-2	2-3
Cosmos Containers	1	1	1	2
Vector Search	No	No	No	Yes

All four strategies share a common foundation: Azure Cosmos DB for NoSQL with partition key /session_id. The difference is in what they store and how they query.

Container	Strategy	Document Types	Special Indexes
`direct_llm_sessions`	Direct LLM	session	None
`sliding_window_sessions`	Sliding Window	session, msg	None
`hierarchical_sessions`	Hierarchical	session, msg, tier2_summary, tier3_facts	None
`entity_graph_sessions`	Entity Graph	session, msg	None
`entity_graph_entities`	Entity Graph	entity (with embedding)	DiskANN vector index on `/embedding`

Entity Document Structure

{
  "id": "session123::cazton.com",
  "session_id": "session123",
  "name": "cazton.com",
  "type": "organization",
  "facts": [
    "Consulting, Training, Recruiting",
    "One stop shop for AI and custom software",
    "Homepage: https://cazton.com/"
  ],
  "related_to": ["Chander Dhall", "Austin", "Dallas"],
  "searchText": "cazton.com organization consulting training...",
  "embedding": [0.0123, -0.0456, ...] // 1536 dimensions
}

Retrieval Modes

Vector Only: ORDER BY VectorDistance(c.embedding, @queryVector)
Full-Text Only: FULLTEXTCONTAINS(c.searchText, @phrase)
Hybrid (RRF): ORDER BY RANK RRF(FULLTEXTSCORE(...), VectorDistance(...))
Lexical Fallback: CONTAINS(c.searchText, @keyword, true) - always runs as a safety net

Full transparency on how these benchmarks were run:

LLM: OpenAI gpt-5.4 (chat, summarization & entity extraction)
Embeddings: text-embedding-3-small, 1536 dimensions
Temperature: 0 (deterministic outputs)
Database: Azure Cosmos DB for NoSQL (account: cazton2026)
Retrieval mode: Vector-only (DiskANN) + lexical CONTAINS fallback. Full-text search (FULLTEXTCONTAINS) and Hybrid RRF were not enabled for these benchmarks.
Top-K retrieval: Entity Graph retrieves top-6 entities by cosine similarity (VectorDistance), with lexical CONTAINS fallback when vector results are empty.
Prompt templates: Identical system prompt across all strategies; only the context-assembly method differs.
Seed data: 60 Cazton fact messages, loaded identically into all 4 strategies
Measurement: Single run per question (not averaged); latency = client-side wall-clock time for OpenAI API call (warm, not cold start)
Token counting: Reported by OpenAI usage object (prompt_tokens, completion_tokens)
Pass/fail criteria: Automated anchor keyword matching (e.g., must contain "/consulting" or "cazton.com/consulting")
Captured: March 4, 2026 at 21:05:51 UTC

Every question, every strategy, every metric. The full run log (40 rows).

Round	Question	Strategy	Status	Prompt Tok.	Total Tok.	Latency	Fail Reason
R1	Three headline offerings?	Direct LLM	BASELINE	93	149	1,743ms
		Sliding Window	PASS	1,335	1,360	826ms
		Hierarchical	PASS	1,307	1,351	1,012ms
		Entity Graph	PASS	1,773	1,817	1,058ms
R2	Leadership profile URL?	Direct LLM	BASELINE	92	139	1,639ms
		Sliding Window	PASS	1,334	1,358	812ms
		Hierarchical	PASS	1,343	1,373	920ms
		Entity Graph	PASS	1,885	1,915	926ms
R3	Cosmos video & ebook slugs?	Direct LLM	BASELINE	94	148	1,314ms
		Sliding Window	PASS	981	1,049	2,003ms
		Hierarchical	PASS	1,368	1,403	1,219ms
		Entity Graph	PASS	1,479	1,517	977ms
R4	Contact & workshops paths?	Direct LLM	BASELINE	89	153	1,770ms
		Sliding Window	PASS	1,026	1,050	928ms
		Hierarchical	PASS	1,388	1,412	2,202ms
		Entity Graph	PASS	1,885	1,908	1,821ms
R5	Canonical URL & OG image?	Direct LLM	BASELINE	90	186	2,122ms
		Sliding Window	PASS	1,026	1,054	770ms
		Hierarchical	PASS	1,404	1,432	864ms
		Entity Graph	PASS	1,908	1,936	719ms
FB1	Related URLs for profile?	Direct LLM	BASELINE	93	376	4,925ms
		Sliding Window	FAIL	1,042	1,087	1,505ms	Missing /consulting
		Hierarchical	PASS	1,345	1,406	1,464ms
		Entity Graph	PASS	1,290	1,419	2,189ms
FB2	Profile + products + asset?	Direct LLM	BASELINE	96	262	2,996ms
		Sliding Window	FAIL	1,072	1,134	1,445ms	Missing /products
		Hierarchical	FAIL	1,394	1,479	1,683ms	Missing /products
		Entity Graph	PASS	1,505	1,565	1,229ms
FB3	Training URL?	Direct LLM	BASELINE	83	158	1,590ms
		Sliding Window	PASS	1,109	1,131	1,056ms
		Hierarchical	PASS	1,456	1,469	1,586ms
		Entity Graph	PASS	1,966	1,988	1,107ms
FB4	Profile + videos + presentations?	Direct LLM	BASELINE	90	257	3,931ms
		Sliding Window	FAIL	1,115	1,164	1,471ms	Missing /presentations
		Hierarchical	PASS	1,448	1,512	1,478ms
		Entity Graph	PASS	1,455	1,498	1,306ms
FB5	Profile + products + asset (graph)?	Direct LLM	BASELINE	101	256	3,481ms
		Sliding Window	FAIL	1,159	1,214	1,232ms	Missing graph context
		Hierarchical	FAIL	1,513	1,585	1,205ms	Missing graph context
		Entity Graph	PASS	1,409	1,469	1,111ms

The following table shows the four strategies plotted by average prompt token cost against recall rate in this run. Bubble size in the original diagram corresponds to average latency.

Strategy	Avg Prompt Tokens	Recall Rate	Avg Latency (ms)
Direct LLM	92	0%	2,505
Sliding Window	1,100	60%	1,108
Hierarchical	1,397	80%	1,343
Entity Graph	1,656	100%	1,258

In this run, Entity Graph reached 100% recall at ~1,656 avg prompt tokens. Sliding Window used ~1,100 tokens with 60% recall. Higher-recall strategies are clustered in the upper-right region of the cost/recall space.

There is no single best strategy. The right choice depends on your requirements for recall precision, token budget, and implementation complexity. Match your memory architecture to your application's needs.

Direct LLM - Stateless Queries

One-shot Q&A, code generation, and translation are tasks where prior context is irrelevant. Lowest cost, zero infrastructure.

Sliding Window - Short Conversations

Best suited for customer support chats and quick troubleshooting sessions where conversations stay under 30 turns and recent context matters most. Lowest prompt-token use among memory-enabled options in this run.

Hierarchical - Medium-Length Workflows

Best suited for project planning and multi-session consultations where key facts need to persist across 30 to 100 turns. Good balance of recall and cost.

Entity Graph - Long-Running Agents

Best suited for enterprise assistants, knowledge workers, and CRM-integrated bots where high identifier-level recall matters and conversations span many turns. Highest recall in this run, with more implementation overhead.

The Bottom Line

Dimension	Winner	Why
Lowest Token Cost	Sliding Window	~1,100 prompt tokens avg - 33% less than Entity Graph
Fastest Latency	Sliding Window	~1,068 ms avg - fastest for simple recall
Best Recall	Entity Graph	100% recall in this run (10/10); do not assume universal performance without local validation
Best Balance	Hierarchical	80% recall with moderate token cost
Simplest Implementation	Direct LLM	~60 lines of code, no memory management

Final Thought: Memory is not just a feature; it is a spectrum. Start with the simplest strategy that meets your recall requirements, and upgrade when your users' conversations demand it. In this reference implementation, all four strategies share the same partition key (/session_id), SDK, and operational model. Moving from Sliding Window to Entity Graph can be handled as an application-layer change instead of an infrastructure migration.

Cazton is composed of technical professionals with expertise gained all over the world and in all fields of the tech industry and we put this expertise to work for you. We serve all industries, including banking, finance, legal services, life sciences & healthcare, technology, media, and the public sector. Check out some of our services:

Cazton has expanded into a global company, servicing clients not only across the United States, but in Oslo, Norway; Stockholm, Sweden; London, England; Berlin, Germany; Frankfurt, Germany; Paris, France; Amsterdam, Netherlands; Brussels, Belgium; Rome, Italy; Sydney, Melbourne, Australia; Quebec City, Toronto Vancouver, Montreal, Ottawa, Calgary, Edmonton, Victoria, and Winnipeg as well. In the United States, we provide our consulting and training services across various cities like Austin, Dallas, Houston, New York, New Jersey, Irvine, Los Angeles, Denver, Boulder, Charlotte, Atlanta, Orlando, Miami, San Antonio, San Diego, San Francisco, San Jose, Stamford and others. Contact us today to learn more about what our experts can do for you.