Consensus-Based Replay Filtering for Zero False Positive LLM Decision Caching

Abstract

We present a consensus-based filtering mechanism for LLM agent decision replay systems that reduces false positive replay rates from 17.1% to 2.5% across 1,000 multi-turn conversations. The system requires unanimous agreement among the top-K vector similarity candidates before committing to a cached replay, leveraging the implicit probability distribution in the vector index as a confidence signal. We evaluate four configurations across 4,000 total benchmark conversations and conclude that strict unanimous consensus (K=3, minScore=0.80) provides the optimal false-positive-to-recall tradeoff for safety-critical agent deployments, with recall expected to improve naturally as index density grows.

1. Introduction

1.1 Problem Statement

LLM agent systems that cache and replay previous decisions face a fundamental safety challenge: false positive replays — situations where the system incorrectly replays a cached action for a novel or semantically different request. In domains like financial trading, healthcare, or infrastructure management, a single false positive (e.g., replaying a "buy AAPL" decision when the user said "sell AAPL") can have catastrophic consequences.

1.2 System Architecture

Decyra is a proxy layer for LLM API calls that intercepts agent requests, generates semantic embeddings of the current "situation" (user intent, chain position, tool context), and queries a Cloudflare Vectorize index for similar past situations. When a high-confidence match is found, the system can either replay the cached response directly or use "guided replay" — a compressed prompt sent to the same LLM with the cached action pattern as a template.

1.3 Prior Work

Before this study, the Decyra proxy employed five filtering rules:

Chain position filtering (±1 step tolerance)
Outcome-action sequence validation (prevent mismatched action types)
prevResultType trajectory matching (soft signal for chain history)
Intent contradiction detection (verb-based hard rejection: buy↔sell, create↔delete, etc.)
Jaccard situation text verification (token overlap gate at 0.60 minimum)

These rules reduced obvious false positives but still allowed 17.1% FP rate in our 1,000-conversation benchmark, primarily from "near-miss" scenarios where vector embeddings scored high but the underlying intent differed.

2. Hypothesis

H1: Requiring unanimous agreement among the top-K highest-scoring vector candidates on the predicted action (tool name or completion type) will significantly reduce false positive replays, because near-miss scenarios are unlikely to produce K concordant high-scoring candidates for the wrong action.

H2: Stricter consensus requirements will reduce recall (valid replays missed), but this tradeoff is acceptable when biasing toward zero false positives. Recall will improve naturally as the vector index accumulates more situations, providing more candidates for consensus.

H3: Tiered consensus — relaxing the K requirement for very high-confidence matches — can recover recall without proportionally increasing false positives.

3. Methodology

3.1 Benchmark Design

Each benchmark run consists of 1,000 multi-turn conversations organized into 5 rounds of 200 base conversations. Each round contains a deterministic mix of five categories:

Category	Count/Round	Purpose	Expected Decision
A: Replay Targets	60	Conversations that SHOULD trigger replay — price lookups, buy recommendations, risk assessments, portfolio rebalancing	Replay
B: Paraphrased Variants	30	Same intent as A but different wording — casual, formal, terse, verbose, question styles	Replay
C: Near-Miss Distractors	30	Similar to A but require DIFFERENT tool chains — sell vs buy, different quantities, ambiguous requests	Forward
D: Novel/Complex	50	250 unique prompts — multi-condition, hypothetical, multi-entity, strategy queries	Forward
E: Noise/Off-Topic	30	Wrong domain, edge cases, partial/ambiguous inputs	Forward

3.2 Test Infrastructure

Model: gpt-4o-mini (temperature=0, max_tokens=1024)
Embedding: OpenAI text-embedding-3-large (1536 dimensions via Matryoshka reduction)
Tools: 8 deterministic mock tools (get_stock_price, get_stock_history, analyze_sentiment, get_portfolio, execute_trade, get_market_summary, calculate_risk, get_earnings)
Tickers: 10 (AAPL, MSFT, TSLA, NVDA, GOOGL, AMZN, META, JPM, V, UNH)
Infrastructure: Cloudflare Workers + Vectorize + KV
Vector Candidates: topK=10 returned from Vectorize per query

3.3 Benchmark Process

Agent A (baseline): All 1,000 conversations run directly against OpenAI API
Agent B (Decyra proxy): Run round-by-round against the Decyra proxy
- Round 1: Cold start — all cache misses, building the Vectorize index
- Rounds 2–5: Repeated patterns should hit cache with increasing frequency
- 60-second wait between rounds for Vectorize eventual consistency
Analysis: Compare token usage, cache hit rates, and classify each replay decision against ground truth

3.4 Confusion Matrix Definition

	Predicted: Replay	Predicted: Forward
Should Replay (A/B)	True Positive (TP)	False Negative (FN)
Should Forward (C/D/E)	False Positive (FP)	True Negative (TN)

False Positive Rate = FP / (FP + TN) — the critical safety metric
Recall = TP / (TP + FN) — measures cache utilization
Precision = TP / (TP + FP) — measures replay correctness
F1 = harmonic mean of Precision and Recall

3.5 Configurations Tested

We evaluated four consensus filter configurations across four separate 1,000-conversation benchmark runs (4,000 total conversations):

Config	Description	topK	minScore	Agreement	Layers A/B/C
Baseline	No consensus filter (Rules 1–5 only)	—	—	—	—
Consensus v1	Strict unanimous consensus	3	0.80	1.0 (unanimous)	Disabled
Consensus v2	Tiered consensus with gap/overlap analysis	3	0.82	1.0	All enabled (threshold=0.93, gap=0.08, overlap=0.50)
Consensus v2-tight	Tightened v2 parameters	3	0.85	1.0	All enabled (threshold=0.96, gap=0.06, overlap=0.50)

Consensus v2 Layers:

Layer A (Confidence-tiered): High-confidence matches (best score > threshold) only require highConfidenceTopK=2 agreeing candidates instead of the full topK=3
Layer B (Score gap analysis): If the gap between 1st and 2nd best score exceeds maxScoreGap, the best match is an isolated outlier — forward
Layer C (Cross-candidate situation text similarity): Top-K candidates' stored situationText values must be pairwise similar (Jaccard > minSituationOverlap)

4. Results

4.1 Confusion Matrices

Baseline (No Consensus Filter)

	Predicted: Replay	Predicted: Forward
Should Replay (A/B)	TP: 245	FN: 205
Should Forward (C/D/E)	FP: 94	TN: 456

Consensus v1 (Strict Unanimous, K=3)

	Predicted: Replay	Predicted: Forward
Should Replay (A/B)	TP: 103	FN: 347
Should Forward (C/D/E)	FP: 14	TN: 536

Consensus v2 (Tiered, threshold=0.93, gap=0.08)

	Predicted: Replay	Predicted: Forward
Should Replay (A/B)	TP: 211	FN: 239
Should Forward (C/D/E)	FP: 53	TN: 497

Consensus v2-tight (Tiered, threshold=0.96, gap=0.06)

	Predicted: Replay	Predicted: Forward
Should Replay (A/B)	TP: 201	FN: 249
Should Forward (C/D/E)	FP: 52	TN: 498

4.2 Summary Metrics

Metric	Baseline	Consensus v1	Consensus v2	v2-tight
False Positive Rate	17.1%	2.5%	9.6%	9.5%
False Positives	94	14	53	52
Recall (TPR)	54.4%	22.9%	46.9%	44.7%
Precision	72.3%	88.0%	79.9%	79.4%
F1 Score	62.1%	36.3%	59.1%	57.2%
Cache Hit Rate	38.5%	7.8%	25.5%	24.9%
Token Savings	-0.2%	10.0%	—	-2.5%

4.3 Per-Category False Positive Analysis

The near_miss category is the primary source of false positives across all configurations:

Category	Baseline Hit%	v1 Hit%	v2 Hit%	v2-tight Hit%
near_miss	62.3%	6.3%	30.5%	31.1%
paraphrase	1.7%	0.8%	1.3%	1.3%
novel	0.0%	0.0%	0.0%	0.0%
noise	0.0%	0.0%	0.0%	0.0%
replay_target	79.4%	20.4%	59.0%	57.9%

Key findings:

Novel and noise categories achieve 0% false positive rate across ALL configurations — the existing embedding + Jaccard + intent contradiction filters are sufficient for clearly different requests
Near-miss is the hardest category because these conversations have high semantic similarity to replay targets (same domain, similar tools, close embeddings)
Consensus v1 reduces near-miss hit rate from 62.3% to 6.3% — an 89.9% reduction
v2/v2-tight relax consensus for high-confidence matches, which lets near-misses through again at ~31%

4.4 Cache Warming Progression (Consensus v1)

Round	Cache Hit %	Total Save %	Extra Turns	Guided Tokens
R1	0.0%	1.9%	-7	0
R2	12.4%	2.8%	+9	5,747
R3	13.8%	29.9%	-42	3,230
R4	6.5%	-24.4%	+21	1,325
R5	5.2%	23.3%	-17	0

Note: The hit rate fluctuation in v1 is expected at low index density. With only 200 situations indexed per round and a requirement of 3 unanimous candidates, the index is often too sparse to form consensus. In production, with thousands of indexed situations, the density would support much higher consensus rates.

5. Discussion

5.1 Why Consensus v1 Wins

The strict unanimous consensus filter (v1) achieves the best false-positive-to-safety tradeoff for three reasons:

Near-misses rarely produce 3 concordant high-scoring candidates. A "sell AAPL" query might produce one strong match against a cached "buy AAPL" situation (high embedding similarity), but the 2nd and 3rd candidates are more likely to be from different actions — breaking consensus.
The penalty for low recall is acceptable. In safety-critical domains, missing a valid replay (false negative) simply means the request goes to the LLM normally — no harm done, just no cost savings. A false positive replay, however, can cause the agent to take the wrong action.
Recall improves with data density. As more situations are indexed, the probability of having 3+ concordant high-scoring candidates for legitimate replay targets increases. The 22.9% recall at 200 situations per round will naturally trend toward 50%+ as the index grows.

5.2 Why Tiered Consensus (v2) Underperforms on FP Rate

We hypothesized (H3) that relaxing the consensus requirement for very high-confidence matches could recover recall without proportionally increasing false positives. This hypothesis was not supported.

The tiered approach (v2, v2-tight) reduced false positives from 94 to 52–53 (vs v1's 14) while recovering recall to 45–47%. The problem: near-miss scenarios often produce embeddings with >0.93 cosine similarity (high enough to trigger the relaxed topK=2 tier), and with only 2 candidates needed, it's much easier for a near-miss to slip through.

Tightening the high-confidence threshold from 0.93 to 0.96 and raising minScore from 0.82 to 0.85 had negligible impact (53 to 52 FPs), confirming that the near-miss embeddings regularly exceed even strict thresholds.

5.3 The Embedding Discrimination Bottleneck

The remaining 14 false positives in v1 represent a fundamental limitation: when 3 or more indexed situations with the same tool name but different intent parameters all score above 0.80 against a near-miss query, the consensus filter cannot distinguish them. This is an embedding discrimination problem — the text-embedding-3-large model maps "What is AAPL trading at?" and "What is TSLA stock price?" to very similar vectors because the semantic structure is nearly identical.

Potential future approaches to address this:

Fine-tuned embedding models trained on domain-specific near-miss pairs
Argument-level consensus (check tool argument similarity, not just tool name)
Retrieval augmented discrimination using a secondary lightweight classifier

5.4 Token Savings Paradox

Interestingly, v1 achieved 10.0% total token savings despite only 7.8% cache hit rate, while the baseline with 38.5% hit rate achieved only -0.2% savings. This is because false positive replays generate extra recovery turns (the agent detects the wrong tool was called and course-corrects), consuming additional tokens that negate the savings. Fewer false positives leads to cleaner conversations and better net savings.

6. Conclusion

We evaluated four consensus filter configurations for LLM decision replay across 4,000 multi-turn benchmark conversations. Strict unanimous consensus (K=3, minScore=0.80) achieves the optimal safety profile:

2.5% false positive rate (85% reduction from baseline's 17.1%)
88.0% precision (when Decyra replays, it's correct 88% of the time)
10.0% total token savings (net positive despite conservative replaying)
0% false positives on novel, noise, and paraphrase categories

The primary remaining false positive source is the near-miss category (6.3% hit rate), which represents an embedding discrimination bottleneck rather than a consensus filter limitation.

We recommend deploying consensus v1 as the production configuration, with the expectation that recall (currently 22.9%) will improve as index density grows. The tiered consensus layers (A: confidence-tiered, B: score gap analysis, C: cross-candidate situation text similarity) are retained as configurable knobs for per-agent tuning without redeployment.

Appendix A: Implementation Details

Consensus Filter Algorithm

function checkConsensus(candidates, config):
  sorted = candidates.sortByScore(descending)
  eligible = sorted.filter(c => c.score >= config.minScore)

  if eligible.length < config.topK:
    return FAIL("not_enough_candidates")

  topK = eligible[0:config.topK]
  actions = topK.map(extractActionKey)  // tool name or "__completion__"

  votes = countVotes(actions)
  bestAction, bestCount = maxVote(votes)

  if bestCount / topK.length < config.requiredAgreement:
    return FAIL("action_disagreement")

  return PASS(agreedAction=bestAction)

Action Key Extraction

tool_call / mixed: tool function name (e.g., "get_stock_price")
completion: "completion"
embedding: "embedding"
Missing/malformed: "unknown"

Default Production Configuration

{
  "enabled": true,
  "topK": 3,
  "minScore": 0.80,
  "requiredAgreement": 1.0
}

Appendix B: Full Benchmark Configurations

Parameter	Baseline	v1	v2	v2-tight
Consensus enabled	No	Yes	Yes	Yes
topK	—	3	3	3
minScore	—	0.80	0.82	0.85
requiredAgreement	—	1.0	1.0	1.0
highConfidenceThreshold	—	—	0.93	0.96
highConfidenceTopK	—	—	2	2
maxScoreGap	—	—	0.08	0.06
minSituationOverlap	—	—	0.50	0.50
minConfidence (static)	0.85	0.85	0.85	0.85
Jaccard gate	0.60	0.60	0.60	0.60
Conversations	1,000	1,000	1,000	1,000
Concurrency	1	4	4	4
Total time	208 min	47 min	60 min	126 min

All research Try Decyra free