The Consolidation Queue: How Persistent AI Earns Durable Change
A proposal for the consolidation queue: a probationary layer between new information and durable change in persistent AI systems.
In the last post, I argued that a persistent assistant should be easy to inform and hard to permanently rewrite. That raises an implementation problem immediately: what sits between I just learned this and this should now shape my behavior?
I think most systems are missing a layer that I'm calling the consolidation queue. It's a probationary place where candidate updates can wait before earning durable change. The queue is meant to solve a specific control problem. What enters? What state does it carry? What affects the next answer right away? What earns promotion into something more durable? What expires? What gets escalated?
It also has to earn its place against the strong baseline of in-context learning. ICL is effectively zero-cost promotion with zero durability, zero reversibility, and weak cross-session coherence. A consolidation queue only makes sense if you want more than that.
The Queue as a Control Object
The best high-level prior here is still Complementary Learning Systems. Intelligent systems seem to need both a fast store for new specific experiences and a slower store that integrates structure over time. In biology, that often gets framed as hippocampal episodic capture paired with slower cortical integration; in machine learning, the same stability-plasticity problem shows up as catastrophic forgetting, which is why regularization methods like Elastic Weight Consolidation exist at all. CLS original paper, CLS updated review, EWC
The queue is the transfer mechanism. A candidate update should not look like a loose string in memory, but a tracked object with state.
CandidateUpdate {
claim
claim_type
source
timestamp
scope
stakes
urgency
verification_state
recurrence_count
interference_risk
decay_timer
}
The exact schema matters less than the principle. A candidate update should have identity, provenance, and a path through the system.
There is a hidden assumption inside that object: identity. Recurrence, corroboration, and promotion all depend on knowing when two formulations are the same underlying claim. “Netflix bought Warner Bros.” on Monday and “the Netflix–Warner deal” on Tuesday should usually increment the same candidate, not create parallel queue items that never accumulate enough evidence. In practice, that means canonicalization. It's some mix of embeddings, coreference, entailment checks, and often an LLM judge. Systems like A-MEM and production memory tools such as Zep spend real engineering effort here for a reason. Without canonicalization, the queue fragments from above and gets gamed by paraphrase from below. A-MEM, Zep docs
The canonicalization step also has to preserve, not flatten, provenance records, because the same object is later expected to support inference-time hedging, attribution, and reversibility. In practice, this almost certainly has to split into a cheap online approximation and a deeper offline pass. Otherwise identity resolution becomes a latency bottleneck before the queue ever has a chance to help.
The queue becomes useful when it enforces state transitions. At minimum, I think you want something like
PENDING -> {CORROBORATED, CONTESTED, MIXED} -> {PROMOTED, EXPIRED}
This way promotion and expiration are explicit. Demotion should probably exist too, even if a first version hides it. The queue is where the system decides what kind of thing an experience is becoming.
That direction is already visible in current systems. MemGPT explicitly separates memory tiers and moves information between them. A-MEM moves beyond raw transcript storage toward structured, linked notes with contextual attributes. Memory-R1 goes further and learns memory operations directly, training a memory manager to choose among add, update, delete, and no-op actions instead of relying only on static heuristics. MemGPT, A-MEM, Memory-R1
The Three Questions
The structure gets much clearer once you pull apart what most systems blur together. I think there are three questions here
- Should this affect the next answer?
- Should this be stored as an episode?
- Should this become durable?
These questions sound more similar than they are. Q1 is the pending-retrieval policy, Q2 is whether the experience is also stored as a retrievable episode in parallel, and Q3 is the promote-or-expire gate.
One place memory systems get especially fuzzy is a type of Schrödinger’s memory. A claim is in probation, but nobody says what the model should do with it right now.
Items in PENDING should usually be available to inference, but only as tentative state with provenance attached. Otherwise the queue becomes either too weak to matter or too risky to trust. “Available to inference” does not mean “always injected into the prompt.” Pending items need their own retrieval policy — relevance, scope, urgency, and recency should decide what makes it onto the hot path at all — otherwise the queue becomes an inference tax on every answer.
Take a factual claim like “Netflix acquired Warner Bros. last month.” At inference time, a queued claim like this should be surfaced tentatively, with provenance attached, not as settled fact.
“You recently told me Netflix acquired Warner Bros., but I haven’t verified that yet.”
This visible form of hesitation is mostly for factual or externally verifiable claims. Preference-like instructions are different. If a user says “Use Python” or “Be more terse,” the right default is usually immediate adoption at a narrow scope, not a display of epistemic suspense. The queue still matters, but more as scoped routing and later confirmation than as visible skepticism.
That hesitation does not have to live only in the model’s prose. It can also be surfaced in the product. A pending claim might appear with a small status marker, a provenance chip, or an expandable note showing where it came from and whether it has been verified. The important thing is that the user experiences the state as tentative but remembered, not as either silent forgetting or false confidence. In other words, the backend queue should usually have a frontend counterpart.
A queue only matters if it can be drained. Otherwise it is just a backlog.
There are only two real exits
- promotion
- expiration
Promotion means the update earned a move into a more durable layer. Expiration means it did not recur, did not verify, did not get used, or simply did not justify taking up permanent oxygen. This is why the promotion step is the hard part, because the real problem is deciding when noisy, partial, potentially dependent evidence has crossed the line into justified change.
Scoring Promotion
An explicit, if crude, first version
where R captures recurrence, V verification, U calibrated uncertainty or self-consistency, I interference risk, and D staleness.
Then
The important part here is that the threshold depends on the destination. Promoting a candidate into a retrievable text store should be cheaper than promoting it into an adapter, and vastly cheaper than writing it into a parametric layer.
The weighted sum is probably wrong even as a scaffold. High recurrence with low verification can be a warning sign, not reassurance. Low staleness with high interference is a different risk profile than the reverse. In practice the real question is whether the scorer is a tree of gated conditionals, a small learned policy, or a hybrid. The functional form is itself part of the research problem.
I would actually treat the linear equation less as a literal decision rule than as a first-pass objective. If you are building a learned memory policy, this kind of score is more plausibly a reward proxy or utility signal than a hardcoded promotion heuristic. In that frame, the signs and terms still matter, but the decision boundary is something the policy learns rather than something the designer pretends to know upfront.
Signals the Scorer Needs
I would also be more specific about the uncertainty term than most memory-system sketches are. For U(c), the interesting signal is not generic confidence or source reputation. Provenance already lives in the object. The better signal is calibrated uncertainty at the semantic level. Namely whether the model’s answers remain stable across resampling once you collapse away superficial wording differences. That is the appeal of semantic uncertainty and semantic entropy, and it is exactly the kind of signal that seems underused in memory policies today. Anthropic’s Language models (mostly) know what they know points in the same direction. Self-evaluation is imperfect, but not useless.
Interference matters differently depending on destination. Retrieval-store interference and parametric interference are not the same phenomenon. One is about ranking pollution and context competition; the other is about overlap with existing learned behavior. The threshold is already destination-conditional, and interference should be too. In practice, parametric interference is only partly predictable ex ante; much of it is estimated through proxies, with the real answer only emerging after targeted evaluation.
claim_type is also doing more work than the sketch admits. A useful spine here is something like: factual world-state, user preference, procedural knowledge, identity or profile attribute, and policy or constraint. Verification strategies do not generalize across those types. You cannot externally verify a preference the way you verify a world fact, and you cannot “corroborate” an identity attribute the same way you corroborate a transaction or acquisition. The threshold function only starts to feel real once the type system is made explicit.
Memory-R1 is one of the clearest recent examples of why this should be treated as a policy problem rather than a pile of heuristics. It learns memory operations with RL and shows that memory-management policy itself is worth optimizing directly. That does not solve promotion in general, but it is a strong sign that the decision layer belongs in the design, not outside it.
The reason that matters is it's feedback. “Usage” is only a useful signal if you can tell whether retrieving a queued item actually helped. In practice that is messy as non-corrections are confounded, task success is noisy, explicit preference signals are expensive, and the clearest supervision often arrives long after the promotion decision itself. Memory-R1 leans on RL because credit assignment here is genuinely hard. The analogy to Prioritized Experience Replay is structural rather than literal. In both cases the system learns or assigns which items deserve scarce update attention, and in both cases badly designed priority signals can be gamed.
A learned promotion policy also has a cold-start problem. One plausible bootstrap is synthetic offline supervision — replayed transcripts where an oracle LLM or human labeler marks promotion-worthy moments — before the system graduates to noisier outcome-based learning.
Verification itself is not exogenous and free. It is a budget decision and a potential attack surface. A system that auto-verifies everything in PENDING can be forced into wasteful retrieval work or steered toward poisoned sources. That means the queue needs thresholds not just for promotion, but for when a claim is worth spending verification budget on at all.
A Full Example
Take the Netflix example again and run the full lifecycle. It is a useful way to map the three questions back onto the mechanism: Q1 → the hedge, Q2 → the episode, Q3 → the threshold.
The user says
“Netflix acquired Warner Bros. last month.”
The claim enters the queue tagged with
- source = user
- scope = likely global-fact candidate
- verification = pending
- recurrence_count = 1
- urgency = low
- interference_risk = medium
- decay timer = set
If the user asks again the next day, the system should not assert the acquisition as settled fact. It should surface it with provenance and uncertainty attached.
Meanwhile, retrieval checks news sources. Two outlets confirm. One contradicts. Verification shifts from PENDING to MIXED. Of course, this only helps if those sources are genuinely independent and reasonably trustworthy, which is exactly why verification is useful but not a truth oracle. LoCoMo is a good reminder that even with long context and retrieval, models still struggle with long-range temporal and causal reasoning in dialogue.
If corroboration converges over time, the claim may cross a promotion threshold and move into a more durable layer. If nothing reinforces it, it decays out. The key bit here is the system does not end up permanently remembering being told something false, but it also does not throw away potentially useful information too early.
This is also why many memory bugs are really promotion bugs. Over-trusting a user’s mistaken correction, carrying one project’s convention into another, keeping a stale preference around too long, or letting a local adaptation damage broader behavior all look like different failures in product. Underneath, they are often the same failure. Either something got promoted too quickly, too broadly, or for too long.
Expiration also needs to be a real policy, not just a default timeout. Different claim types should decay differently, and use, reinforcement, and queue pressure should all matter. Otherwise the queue turns into a slowly growing archive of indecision.
Where Promotions Land
Promotion also is not one destination. This is where memory writing can get sloppy.
A useful implementation should distinguish at least four destinations
- episodic memory for retrievable events with provenance
- scoped durable memory for user-, project-, or org-specific patterns
- temporary override layers for urgent narrow corrections
- rare parametric or policy-level updates for broad, repeated, well-validated changes
That last one should be rare.
These destination classes are semantic, not necessarily physical. Episodic or scoped durable memory may live in a text or structured store; a temporary override may be implemented as a narrow policy layer; only the rarest promoted items should reach adapters or parametric updates. If “adapter promotion” is on the table, it is worth naming the concrete family here: parameter-efficient fine-tuning methods like LoRA and IA³ are what make that destination practical in the first place.
If a promoted update is headed toward a parametric destination, the queue still needs a data-compilation step. Candidate updates do not become gradients directly. They have to be turned into supervised examples, contrastive pairs, or preference data before a LoRA-style update can happen. In practice, that means the offline consolidation loop is also doing orchestration: formatting promoted candidates into SFT-style records, DPO-style comparisons, or other trainable artifacts.
This is where model editing becomes relevant but also dangerous. ROME and MEMIT show that targeted parametric edits are possible, but they also show why direct weight updates should not be the default landing zone for routine memory. More recent lifelong editing work like MEMOIR makes the same point from another angle. Sequential edits need explicit overwrite control, or they start stepping on each other.
So the real policy is not just promote or not.
It is promote to where, under what evidence, with what reversibility?
Online and Offline Consolidation
Another implementation instinct I feel pretty strongly about: most promotion should happen off the hot path.
Online, the system should use queued information tentatively. Offline, it should decide what deserves consolidation.
That split is already visible in current work. Sleep-Time Compute argues that some useful reasoning can be amortized between user queries rather than forced onto the inference path. LightMem makes the same instinct explicit in memory form, using a sleep-time update procedure that decouples long-term consolidation from online inference.
The rhythm that makes sense to me
online: tentative with provenance
offline: decisive with audit
That is both computationally cleaner and easier to govern.
We default to hesitation but it's not dogma. Some updates are dangerous to delay — a safety-critical correction, a live vulnerability, a newly contraindicated interaction. In those cases, urgency changes the threshold and the destination.
Governance
Governance is part of the design too. If the queue controls durable change, it becomes a target. That means every promotion should be attributable, every durable update should be reversible, every substrate should be inspectable, and contradictory promoted updates should be reconcilable. The queue is also part of the security boundary. Persistent agents are vulnerable not just to bad information, but to memory poisoning — attempts to smuggle malicious instructions into durable state through user prompts, retrieved documents, or injected web content. Scoped durable memory and temporary overrides are especially attractive targets, which is another reason durable promotion needs provenance, reversibility, and strong thresholds. SSGM
Demotion matters here too. Easy to inform, hard to permanently rewrite has a Wario: easy to promote, hard to demote. Applying a change cleanly is often easier than unwinding it cleanly, especially once it has propagated across substrates or been written into parametric behavior. Reversibility is the design goal and demotion is the hard operation that forces you to take that goal seriously.
Logging is only the lowest of baselines for governance. It also has to perform adjudication across substrates. If a preference lands in a structured policy store, a factual correction lands in retrievable memory, and a behavioral shift lands in an adapter, the system still needs a way to reconcile conflicts rather than pretending the layers cannot disagree. Current systems mostly handle that with prompt ordering — “put this rule higher in the system message than that retrieved note” — and similar brittle control-plane tricks.
What’s Still Open
The inference-time control plane is not a solved interface. When episodic context, structured memory, temporary overrides, and parametric tendencies disagree, something still has to rank them before generation begins.
The sketch here is also implicitly single-user. In real systems, contradictory preferences across users and cross-user evidence flow make the control loop much messier. Should User B asserting the same global-scope claim strengthen User A’s pending item? Are verification signals pooled? Those are real design questions, but they are out of scope for this version. A scalable enterprise version of the queue would almost certainly need tenant-aware provenance. So it's not just who said something, but what org, team, role, and workspace they were speaking from.
The hardest part is still the scorer.
A learned promotion policy can drift, be gamed, or learn the wrong shortcut. Retrieval-based verification can be poisoned. Multiple corroborating sources may not be independent. Different durable substrates can disagree with each other. Long-term evaluation is still immature enough that we do not yet have one benchmark that really captures all of this. LoCoMo is useful, but it is not the end state.
What is still missing empirically is a clean benchmark story comparing eager-write systems to staged-promotion systems under realistic long-horizon workloads. One obvious candidate would be reversibility under forced contradiction: promote a plausible claim, then inject stronger contradictory evidence and measure whether the system unwinds the earlier update quickly, cleanly, and without collateral damage. The architectural case is getting clearer faster than the comparative evidence.
For now, the best working heuristic seems to be
- treat candidate updates as tracked objects
- separate inference-time use from durable promotion
- score promotion against evidence, uncertainty, interference, staleness, and destination
- do most consolidation offline
- make all of it reversible enough to survive being wrong
Episodic memory remembers what happened. The consolidation queue holds what might matter. Promotion decides whether it does.
The most important point of the consolidation queue is that it gives the problem a control surface, not that it necessarily solves persistent memory. It turns “memory” from a vague product complaint into an object engineers can build, instrument, and argue about.