Chuyển tới nội dung chính

Compounding Memory — Enterprise Agent Context Without the Token Tax

Executive summary

Every enterprise running AI agents at scale eventually meets the same wall. The agents that delight a pilot user start to slow down, cost more per turn, and forget the things that made them useful in the first place. The technical name for the wall is context. The operational name is a linearly scaling token bill and a steadily climbing P95 latency. The strategic name is the gap between agentic systems that compound with every interaction and agentic systems that decay into more expensive autocomplete.

The default remedies do not close this gap. Stuffing the full conversation history into every prompt is what produces the token bill. Stateless agents that forget between turns push the cognitive load back onto the user. Generic vector databases improve retrieval, but they say nothing about what to remember, what to forget, what to summarise, and what to never retain. They say nothing at all about regulated-industry obligations like the right to erasure or tenant-level isolation.

The FAOS platform addresses this gap with a capability called Compounding Memory. The architecture rests on three design choices. First, a three-tier memory hierarchy — raw resources, distilled items, and aggregated categories — that lets the system retrieve from the cheapest layer that holds enough signal. Second, a hybrid retrieval and adaptive stopping pipeline that combines semantic and keyword search, fuses ranks via Reciprocal Rank Fusion, and terminates as soon as sufficient context has been gathered. Third, a regulated-industry lifecycle — tenant-scoped row-level security, configurable decay, audited consolidation, and a per-record erasure path governed by an explicit architecture decision record — so the memory system can be deployed in a SBV-regulated bank or a HIPAA-bounded healthcare workflow without becoming a compliance liability.

The headline implication for a CIO running an enterprise agent programme: memory is not a vector-database integration problem. It is a domain-bounded lifecycle problem. The FAOS Compounding Memory architecture is the implementation of that conviction, designed so context accumulates over time at sub-linear cost, and so an audit committee can answer how that context is governed.

The problem — why enterprise agents lose context, money, or both

The narrative around enterprise agents has converged on a comforting middle. Add a vector database, plug in retrieval, give the agent a long context window, and the memory problem is solved. The narrative is not wrong; it is incomplete. Three failure modes recur across the pilots that the FAOS team has reviewed in regulated industries.

Failure mode one — the token tax. Long-context models scale linearly in cost with input length. A finance analyst agent that retains a quarter of conversation history pays roughly a quarter of the cost on every turn: a 4× per-turn cost penalty against a stateless agent. At single-pilot scale this is negligible. Across a deployment of many agents serving thousands of users, the per-turn penalty becomes the dominant line item in the inference budget. The economics that worked for the proof of concept stop working for the rollout.

Failure mode two — retrieval without lifecycle. A vector database tells you how to find the nearest neighbours of a query. It does not tell you which memories should still be findable. Without a lifecycle, the index accumulates contradictions ("the user prefers REST" alongside "the user now prefers GraphQL"), stale facts ("the user works at Company A" after they have moved to Company B), and irrelevant detritus from one-off interactions. Retrieval accuracy degrades not because the algorithm is wrong, but because the corpus is unmanaged. The agent retrieves a confident answer to the wrong query, and the user — or worse, the regulator — notices.

Failure mode three — the compliance gap. GDPR Article 17, HIPAA, and the SBV banking circulars all require that a system holding personal data be able to surface, correct, and erase that data on request, with an audit trail demonstrating the erasure. Most vector-database integrations were not designed for this. Embeddings are derived artifacts. Summaries are derived artifacts. Categorical aggregations are derived artifacts. A naive "delete the row" approach leaves the derived state intact, which the regulator will read as a failed erasure. The compliance gap is not a feature request. It is a structural property of how memory subsystems are typically built, and it is what blocks enterprise agents from production in regulated workflows.

What enterprise agents need is not better retrieval. They need a memory subsystem that compounds: one that gets cheaper per turn as the conversation lengthens, that maintains a sane lifecycle of what to keep and what to release, and that treats tenant isolation and per-record erasure as architectural primitives rather than late-stage retrofits. The FAOS Compounding Memory architecture is built against those three requirements.

The FAOS approach — three design choices for a compounding memory subsystem

Compounding Memory is a coherent set of architectural choices, not a single component. Three primitives anchor the design. Each is named so the principal engineer and the head of AI can refer to the same thing.

Primitive one — the three-tier memory hierarchy

The memory subsystem stores agent context across three progressive layers of abstraction. Resources are raw conversational turns, document chunks, and event records: the unfiltered substrate. Items are discrete, distilled units derived from resources: a single preference, a single fact, a single observed behaviour. Categories are LLM-generated aggregations of related items: a coherent summary of a topic, a profile section, a rolled-up view of a workflow.

The hierarchy is not just an organisational scheme. It is the engine that drives cost reduction. Retrieval starts at the category layer, the cheapest tier in tokens and latency. If the categories hold enough signal to answer the query, the system never reads an item or a resource. If not, retrieval descends to items. If items are still insufficient, retrieval descends to resources. The cheaper layers carry the typical query; the expensive layers exist for the queries that need them.

This is also the layer where the compounding in Compounding Memory becomes operational. As conversations and interactions accumulate, the consolidation pipeline rolls items into categories; once a body of categorical context exists, subsequent turns retrieve from it rather than from the raw history. The token cost per turn flattens as the conversation grows, instead of scaling linearly with it. The hierarchy is what turns "more interactions" from a tax into a tailwind.

Primitive two — hybrid retrieval with adaptive stopping

Each tier in the hierarchy is searched by a hybrid pipeline that combines two complementary retrieval strategies. Dense retrieval uses embedding-based vector similarity to capture semantic meaning: synonyms, paraphrases, conceptual proximity. Sparse retrieval uses keyword matching over a full-text index to anchor exact terms, proper nouns, and technical vocabulary that embeddings sometimes blur. The two rank lists are fused via Reciprocal Rank Fusion (RRF), a rank-based combination that requires no score normalisation and stays stable when the underlying score distributions differ. A cross-encoder reranker refines the top of the fused list when precision matters more than recall.

Adaptive stopping sits on top of the hybrid pipeline. After retrieval from a tier, an LLM-graded sufficiency check evaluates whether the returned context is enough to ground the agent's next step. If yes, retrieval halts and the agent proceeds. If not, retrieval descends to the next tier. A short-circuit path also exists for queries that need no memory at all (greetings, acknowledgements, generic knowledge requests), which a lightweight pre-retrieval classifier filters out before any embedding is computed.

The design intent is that the typical query terminates at the category tier with a single cheap retrieval and an LLM sufficiency call that runs in tens of milliseconds; the worst-case query descends to resources; and the malformed or trivial query skips retrieval entirely.

Primitive three — the regulated-industry lifecycle

The third primitive is what distinguishes Compounding Memory from a generic memory layer wrapped around a vector store. Four lifecycle mechanisms operate continuously on the corpus.

Importance scoring uses the Generative Agents formula — a weighted sum of recency, importance, and relevance — to assign every item a current score, which then determines retrieval ranking and lifecycle eligibility.

Decay reduces the current score exponentially over time. A configurable half-life (default seven days, configurable per tenant) governs how quickly an unaccessed memory fades. Access resets recency, so frequently used memories persist; rarely used ones fade. Below a configurable threshold, an item is soft-deleted and eventually purged.

Consolidation runs as a scheduled job. Items within a category are clustered by semantic similarity and rolled into a category summary, shrinking the active corpus and lifting subsequent retrievals into the cheap tier.

Erasure is the lifecycle mechanism that matters most for regulated deployment. A per-record erasure API propagates a delete request through all derived artifacts: the original resource, the items extracted from it, the categories that summarised it, the embeddings, the cache entries. The erasure contract is formalised in an architecture decision record under CISO and CLO sign-off review, with a 24-hour SLA from request to verified erasure across the full derived chain and an audit log persisted to the immutable audit subsystem. The right to erasure is not bolted onto the memory subsystem; it is being built in as one of the subsystem's defining mechanisms before any regulated-tenant feature flag activates.

Tenant isolation underwrites all four. PostgreSQL row-level security enforces tenant-scoped access at the database layer, not at the application layer. Cache keys are scoped under an explicit FAOS cache-key contract to a closed enumeration of platform | hot | warm | cold with tenant_id in the second slot, so a single typo cannot leak one tenant's memory to another. Background jobs operate under a service role with explicit policy bypass and audit logging; no convenient back doors exist for an application-layer shortcut.

The architectural commitment is that an audit committee, a regulator, or a tenant security review can trace any record from creation to consolidation to erasure, and that the trace is consumable as evidence rather than as a description.

What the architecture delivers

Compounding Memory is currently a pre-launch architecture. The FAOS production cluster exists and has been load-tested against synthetic workloads, but no regulated-industry tenant has yet been opted into the feature flag. Performance claims in this section are therefore reported at the level the evidence supports: design-time benchmarks against synthetic workloads, architectural properties enforced by the implementation, and a pilot opt-in date.

Design-time performance benchmarks

Internal benchmarks against synthetic conversational workloads — representative of the volume and turn distribution expected from the anchor banking pilot — produced the following results.

  • Latency reduction from adaptive stopping. Comparing the adaptive-stopping pipeline against a baseline that always descends to the resource tier, average per-query latency drops by approximately an order of magnitude on conversational queries dominated by category-tier hits. The headline benchmark number from the 2026-01-14 internal whitepaper is a 91% latency reduction at the average; the same benchmark shows a P95 read latency in the sub-100ms range under the synthetic load profile. These numbers are reported as design-time measurements against a synthetic workload, not as production statistics, and the internal whitepaper labelled them as engineering targets rather than measured operational outcomes.

  • Token efficiency from consolidation and tiering. When category-tier retrieval is sufficient — which is designed to be the common case for repeat-user conversations — the token footprint per query drops by roughly an order of magnitude versus a baseline that injects the full conversation history. The 89-90% token-savings figure from the internal benchmarks is a function of the consolidation lift; it depends materially on the consolidation cadence and the category coverage, both of which are tenant-configurable.

  • Cache hit rate. The Redis cache fronting the persistence layer is designed for hit rates above 70% on read-heavy workloads typical of conversational agents. Internal load tests at 1,000 simulated concurrent users have produced hit rates in the 80%+ range. This is a property of the workload as well as the cache; production performance will depend on the actual access pattern.

These numbers should be read as targets the architecture has demonstrated under controlled conditions, not as commitments for arbitrary production workloads. The anchor banking pilot, scheduled to opt in during the coming weeks, is the first measurement point against real tenant traffic; numbers from that measurement will be released in a follow-up technical note once the pilot has stabilised.

Architectural properties

Independent of any specific performance number, four properties are enforced by the implementation.

Sub-linear per-turn cost. As a conversation accumulates and consolidation rolls items into categories, the token footprint of the next turn does not scale linearly with conversation length. The decoupling between conversation length and per-turn cost is what the word "compounding" in Compounding Memory means in operational terms.

Multi-tenant isolation. Every read and write is scoped by tenant_id at the row-level security layer and at the cache-key layer. The architecture forbids any code path that could observe data from a different tenant, regardless of an application-layer bug. This is enforced at the database, not at the application.

Per-record erasure. A delete request propagates to every derived artifact (embeddings, items, categories, cache entries) within the documented 24-hour SLA, with an audit record persisted on completion. This is the architectural property that allows the system to be deployed in GDPR-bound, HIPAA-bound, and SBV-bound workflows without a structural compliance gap.

Auditability. Every memory mutation, every consolidation, every decay event, and every erasure produces an audit-log entry through the immutable audit subsystem. An audit committee that asks "what happened to this user's data" has a structured answer, not a narrative one.

What the architecture does not yet claim

Three boundaries are worth naming explicitly.

The performance figures above are design-time, not production. The anchor banking pilot, scheduled in the coming weeks, is the first opportunity to measure against real tenant traffic, and the architecture team has committed to publishing the pilot results — including any deviation from the design-time numbers — in a follow-up technical note.

The erasure SLA is architectural intent, governed by an FAOS architecture decision record currently under CISO and CLO sign-off review, and slated for validation against synthetic test cases ahead of the regulated-industry pilot. The first production-grade validation will be the pilot's compliance gate review, with the compliance gate package scheduled ahead of the pilot opt-in.

The consolidation and decay cadences have sensible defaults, but the optimal settings depend on the workload. The architecture exposes them as tenant-level configuration precisely because the right cadence for a high-volume customer-support agent is not the right cadence for a low-volume compliance analyst.

And finally — a point worth surfacing rather than burying — the architecture is one expression of a set of design choices, not the only one. PostgreSQL with pgvector is the implementation of the persistence layer; a different team running similar requirements might choose a specialist vector database with a relational sidecar, or a graph database with embedding columns. The choices that genuinely distinguish Compounding Memory are upstream of the storage decision: the three-tier hierarchy, the adaptive-stopping pipeline, the lifecycle primitives, and the erasure-as-first-class-citizen posture. Those are the contributions worth borrowing or critiquing, independent of any specific database choice.

What this means for your enterprise

Three practical implications fall out of the architecture.

Architectural implication — memory is a lifecycle problem, not a retrieval problem

If you are evaluating an agent platform on the question "how does it handle memory," the question to ask is not "what vector database does it use." The relevant questions are different: How is the memory corpus consolidated over time? What is the decay policy? Can a per-record erasure request be honoured end-to-end within a measurable SLA? Is tenant isolation enforced at the database layer or the application layer? Is the cache key scheme structurally safe against a tenant-id typo?

Generic vector-database integrations rarely have crisp answers to these questions, because they were designed to be retrieval primitives rather than memory subsystems. The Compounding Memory architecture is one example of an answer; the more important point is that the questions have to be asked. An enterprise agent programme that adopts a memory layer without asking them inherits an unbounded compliance and cost surface.

Operational implication — the pilot starts on the regulated tenant

The Compounding Memory rollout sequence is deliberate. The anchor pilot tenant is a SBV-regulated commercial banking deployment, with opt-in scheduled in the coming weeks. The pattern then rolls forward to additional regulated tenants once the banking pilot has stabilised, with a target of broader availability after the pilot retrospective.

The choice to start with the regulated tenant inverts the more common pattern of "ship to easy customers first, harden later." The reasoning is straightforward: a memory subsystem that does not satisfy the SBV erasure obligations on day one will require structural rework after the first regulator inquiry, which is a far more expensive position than over-engineering for compliance upfront. The Compounding Memory design discipline — including the requirement that CISO and CLO sign-off on the erasure governance contract is a hard gate before any feature flag activates in production — is the operational expression of that choice.

For a CIO evaluating Compounding Memory, the implication is that the architecture has been hardened for the highest-bar deployment context. Standard B2B SaaS workflows are a subset of the design surface, not an extension of it.

When the approach is not the right tool

Three honest counter-cases.

Stateless single-turn agents. If your agent does not need to remember anything across turns — an internal tool that runs a single transformation, a one-shot enrichment service — Compounding Memory's lifecycle machinery is over-investment. A stateless prompt with retrieval is a simpler and cheaper fit.

Workflows with no regulatory surface. A marketing-copy drafter or an internal coding assistant operating on non-regulated workflows can use a generic memory layer and accept the compliance gap. The cost of authoring the policy primitives and operating the lifecycle is not free, and the value of those primitives is realised only when a regulator is in the conversation.

Vendor SaaS agents you do not operate. Compounding Memory is the memory subsystem of an agent platform you run. If you are consuming a third-party SaaS agent whose memory subsystem you do not control, the right move is to require the vendor's attestation against the equivalent properties — not to retrofit your own memory architecture on top of an opaque vendor system. The framework here gives you the language to ask the vendor for the right artifact.

Implementation guide

A representative adoption sequence has three phases. The phasing assumes a working FAOS deployment — agents in development or production, retrieval already plumbed, basic observability in place.

Phase one — instrument one workflow (first 90 days)

The right starting workflow has three properties. It is conversational (memory matters most where conversation length grows). It has a clearly bounded domain (so the ontology grounding for the consolidation pass is finite). And it has a domain owner — head of compliance, head of customer experience, head of analytics — who can authorise the policy choices on decay half-life, consolidation cadence, and retention defaults.

A representative phase-one scope: one agent, one tenant, the default lifecycle policies, a baseline measurement of P95 latency and per-turn token cost without Compounding Memory, and the same measurement with Compounding Memory enabled. Success at 90 days looks like: the per-turn cost has dropped, the P95 latency has improved or stayed flat, no cross-tenant leakage has been observed in the audit log, and a synthetic per-record erasure request has been honoured end-to-end within the documented 24-hour SLA.

Phase two — extend coverage and validate compliance (next 90 days)

Two parallel tracks.

Track A — coverage extension. Add the next two or three agents in the same tenant. Each should reuse the same memory backend, the same lifecycle policies, and the same audit infrastructure. The marginal cost of onboarding an additional agent should be near-zero in terms of infrastructure; the only per-agent investment is in the consolidation schedules and the importance-scoring weights, which often inherit cleanly from the first agent's defaults.

Track B — compliance validation. Run a structured per-record erasure drill with the CISO or equivalent: pick five user identifiers at random, request erasure, verify that the erasure has propagated through all derived artifacts within the SLA, and persist the audit trail. The drill is the bridge between architectural intent and a regulator-defensible posture.

Phase three — operationalise (next 6 months)

Two pieces of work that distinguish a research-grade adoption from a production-grade one.

Memory policy registry. Decide who owns the per-tenant policies (decay half-life, consolidation cadence, retention floors, erasure escalation paths). For most enterprises this is a joint owner — the head of AI for the technical defaults, the data-protection officer for the compliance defaults. The registry is the artifact the audit committee inspects.

Drift and quality monitoring. Track the retrieval quality, the consolidation lift, and the erasure SLA in dashboards alongside conventional latency and error metrics. Compounding Memory is a system that accumulates state, and like all stateful systems, the failure modes shift over time. Monitoring is what catches the slow drift before it becomes a tenant-visible incident.

Twelve months in, a successful adoption is operating Compounding Memory across multiple regulated agents in multiple tenants, with a policy registry consumed by the audit function, with documented erasure drills, and with a measured per-turn cost curve that flattens as conversation length grows. The architecture is no longer a research artifact; it is operating rhythm.

Frequently asked questions

Doesn't a long-context model just solve this? A long-context model expands what the agent can attend to. It does not change the per-turn cost, the lifecycle, or the compliance surface. Stuffing the conversation history into a long context still scales linearly in cost per turn. Compounding Memory is what changes the slope of that line, and it is what provides the lifecycle and audit primitives that long context alone does not.

Why PostgreSQL and pgvector instead of a dedicated vector database? Three reasons. Unified storage of relational, vector, and audit data in one system eliminates a class of synchronisation bugs. Native ACID transactions support consistent memory updates. PostgreSQL row-level security is a mature primitive for multi-tenant isolation that dedicated vector databases either approximate or omit. The architectural trade-off is a small loss in pure vector-search performance against a specialist database; the cache layer mitigates that loss for read-heavy workloads, and the operational simplicity is worth the trade for regulated deployments.

How does the erasure mechanism handle embeddings? Embeddings are derived artifacts, and the erasure governance design treats them as such. When a record is erased, the embeddings derived from it are erased in the same transaction. The same applies to extracted items, category summaries that referenced the record, and cache entries. The audit log records every step. The 24-hour SLA is the worst-case bound for the propagation; the design specifies provenance tracking so the cascade is decidable without LLM re-inspection, and the contract is being demonstrated against synthetic erasure drills ahead of the regulated-tenant pilot.

Can we configure the decay and consolidation policies per workflow? Yes. Both are tenant-configurable parameters with safe defaults. A high-volume customer-support agent typically benefits from a shorter half-life and a more aggressive consolidation cadence; a low-volume compliance analyst typically benefits from longer retention and slower consolidation. The policy registry surfaces the active settings and the change history.

What's the integration story with LangChain or LangGraph? The FAOS memory manager exposes an async facade (get_agent_memory, add_conversation, search_memories, consolidate_memories) that is straightforward to adapt to the LangChain BaseChatMessageHistory contract via a thin wrapper, so an existing LangChain-based agent can adopt Compounding Memory without restructuring its conversation chain. The same facade is consumed by LangGraph nodes through the standard memory contract. The reference adapter pattern is documented internally; integration is configuration plus a thin adapter class, not a rewrite of agent code.

Does this replace post-deployment monitoring? No. The architecture provides the audit trail and the lifecycle primitives. Post-deployment monitoring detects drift, anomalies, and quality regressions. Both are required; neither substitutes for the other. The memory subsystem's auditability is what gives the monitoring layer something structured to monitor against.

What to do next

Two paths forward, depending on where you are.

Read the companion paper on pre-deployment verification. Compounding Memory operates inside the broader FAOS architecture for trusted enterprise agents. The pre-deployment verification framework — the operational envelope, ontology-to-scenario generation, and the trust certificate — defines how those agents are certified before they enter production. The RA-6 whitepaper covers it.

See the companion whitepapers: RA-3 — Neurosymbolic Enterprise AI (the grounding architecture that anchors both verification and memory), RA-6 — Pre-deployment Verification, and RA-12 — Entropy-Guided Ontology Design.

Discuss application to your stack. If you are evaluating Compounding Memory for a specific regulated workflow — banking, insurance, healthcare, or a non-English jurisdiction — book a working session with the FAOS team. We are happy to walk through your memory requirements, the policy primitives the lifecycle exposes, and a candidate 90-day pilot scope.

Book a discovery call