Abstract
We present a consensus-based filtering mechanism for LLM agent decision replay systems that reduces false positive replay rates from 17.1% to 2.5% across 1,000 multi-turn conversations. The system requires unanimous agreement among the top-K vector similarity candidates before committing to a cached replay, leveraging the implicit probability distribution in the vector index as a confidence signal. We evaluate four configurations across 4,000 total benchmark conversations and conclude that strict unanimous consensus (K=3, minScore=0.80) provides the optimal false-positive-to-recall tradeoff for safety-critical agent deployments, with recall expected to improve naturally as index density grows.
1. Introduction
1.1 Problem Statement
LLM agent systems that cache and replay previous decisions face a fundamental safety challenge: false positive replays — situations where the system incorrectly replays a cached action for a novel or semantically different request. In domains like financial trading, healthcare, or infrastructure management, a single false positive (e.g., replaying a "buy AAPL" decision when the user said "sell AAPL") can have catastrophic consequences.
1.2 System Architecture
Decyra is a proxy layer for LLM API calls that intercepts agent requests, generates semantic embeddings of the current "situation" (user intent, chain position, tool context), and queries a Cloudflare Vectorize index for similar past situations. When a high-confidence match is found, the system can either replay the cached response directly or use "guided replay" — a compressed prompt sent to the same LLM with the cached action pattern as a template.
1.3 Prior Work
Before this study, the Decyra proxy employed five filtering rules:
- Chain position filtering (±1 step tolerance)
- Outcome-action sequence validation (prevent mismatched action types)
- prevResultType trajectory matching (soft signal for chain history)
- Intent contradiction detection (verb-based hard rejection: buy↔sell, create↔delete, etc.)
- Jaccard situation text verification (token overlap gate at 0.60 minimum)
These rules reduced obvious false positives but still allowed 17.1% FP rate in our 1,000-conversation benchmark, primarily from "near-miss" scenarios where vector embeddings scored high but the underlying intent differed.
2. Hypothesis
H1: Requiring unanimous agreement among the top-K highest-scoring vector candidates on the predicted action (tool name or completion type) will significantly reduce false positive replays, because near-miss scenarios are unlikely to produce K concordant high-scoring candidates for the wrong action.
H2: Stricter consensus requirements will reduce recall (valid replays missed), but this tradeoff is acceptable when biasing toward zero false positives. Recall will improve naturally as the vector index accumulates more situations, providing more candidates for consensus.
H3: Tiered consensus — relaxing the K requirement for very high-confidence matches — can recover recall without proportionally increasing false positives.
3. Methodology
3.1 Benchmark Design
Each benchmark run consists of 1,000 multi-turn conversations organized into 5 rounds of 200 base conversations. Each round contains a deterministic mix of five categories:
| Category | Count/Round | Purpose | Expected Decision |
|---|---|---|---|
| A: Replay Targets | 60 | Conversations that SHOULD trigger replay — price lookups, buy recommendations, risk assessments, portfolio rebalancing | Replay |
| B: Paraphrased Variants | 30 | Same intent as A but different wording — casual, formal, terse, verbose, question styles | Replay |
| C: Near-Miss Distractors | 30 | Similar to A but require DIFFERENT tool chains — sell vs buy, different quantities, ambiguous requests | Forward |
| D: Novel/Complex | 50 | 250 unique prompts — multi-condition, hypothetical, multi-entity, strategy queries | Forward |
| E: Noise/Off-Topic | 30 | Wrong domain, edge cases, partial/ambiguous inputs | Forward |
3.2 Test Infrastructure
- Model: gpt-4o-mini (temperature=0, max_tokens=1024)
- Embedding: OpenAI text-embedding-3-large (1536 dimensions via Matryoshka reduction)
- Tools: 8 deterministic mock tools (get_stock_price, get_stock_history, analyze_sentiment, get_portfolio, execute_trade, get_market_summary, calculate_risk, get_earnings)
- Tickers: 10 (AAPL, MSFT, TSLA, NVDA, GOOGL, AMZN, META, JPM, V, UNH)
- Infrastructure: Cloudflare Workers + Vectorize + KV
- Vector Candidates: topK=10 returned from Vectorize per query
3.3 Benchmark Process
- Agent A (baseline): All 1,000 conversations run directly against OpenAI API
- Agent B (Decyra proxy): Run round-by-round against the Decyra proxy
- Round 1: Cold start — all cache misses, building the Vectorize index
- Rounds 2–5: Repeated patterns should hit cache with increasing frequency
- 60-second wait between rounds for Vectorize eventual consistency
- Analysis: Compare token usage, cache hit rates, and classify each replay decision against ground truth
3.4 Confusion Matrix Definition
| Predicted: Replay | Predicted: Forward | |
|---|---|---|
| Should Replay (A/B) | True Positive (TP) | False Negative (FN) |
| Should Forward (C/D/E) | False Positive (FP) | True Negative (TN) |
- False Positive Rate = FP / (FP + TN) — the critical safety metric
- Recall = TP / (TP + FN) — measures cache utilization
- Precision = TP / (TP + FP) — measures replay correctness
- F1 = harmonic mean of Precision and Recall
3.5 Configurations Tested
We evaluated four consensus filter configurations across four separate 1,000-conversation benchmark runs (4,000 total conversations):
| Config | Description | topK | minScore | Agreement | Layers A/B/C |
|---|---|---|---|---|---|
| Baseline | No consensus filter (Rules 1–5 only) | — | — | — | — |
| Consensus v1 | Strict unanimous consensus | 3 | 0.80 | 1.0 (unanimous) | Disabled |
| Consensus v2 | Tiered consensus with gap/overlap analysis | 3 | 0.82 | 1.0 | All enabled (threshold=0.93, gap=0.08, overlap=0.50) |
| Consensus v2-tight | Tightened v2 parameters | 3 | 0.85 | 1.0 | All enabled (threshold=0.96, gap=0.06, overlap=0.50) |
Consensus v2 Layers:
- Layer A (Confidence-tiered): High-confidence matches (best score > threshold) only require highConfidenceTopK=2 agreeing candidates instead of the full topK=3
- Layer B (Score gap analysis): If the gap between 1st and 2nd best score exceeds maxScoreGap, the best match is an isolated outlier — forward
- Layer C (Cross-candidate situation text similarity): Top-K candidates' stored situationText values must be pairwise similar (Jaccard > minSituationOverlap)
4. Results
4.1 Confusion Matrices
Baseline (No Consensus Filter)
| Predicted: Replay | Predicted: Forward | |
|---|---|---|
| Should Replay (A/B) | TP: 245 | FN: 205 |
| Should Forward (C/D/E) | FP: 94 | TN: 456 |
Consensus v1 (Strict Unanimous, K=3)
| Predicted: Replay | Predicted: Forward | |
|---|---|---|
| Should Replay (A/B) | TP: 103 | FN: 347 |
| Should Forward (C/D/E) | FP: 14 | TN: 536 |
Consensus v2 (Tiered, threshold=0.93, gap=0.08)
| Predicted: Replay | Predicted: Forward | |
|---|---|---|
| Should Replay (A/B) | TP: 211 | FN: 239 |
| Should Forward (C/D/E) | FP: 53 | TN: 497 |
Consensus v2-tight (Tiered, threshold=0.96, gap=0.06)
| Predicted: Replay | Predicted: Forward | |
|---|---|---|
| Should Replay (A/B) | TP: 201 | FN: 249 |
| Should Forward (C/D/E) | FP: 52 | TN: 498 |
4.2 Summary Metrics
| Metric | Baseline | Consensus v1 | Consensus v2 | v2-tight |
|---|---|---|---|---|
| False Positive Rate | 17.1% | 2.5% | 9.6% | 9.5% |
| False Positives | 94 | 14 | 53 | 52 |
| Recall (TPR) | 54.4% | 22.9% | 46.9% | 44.7% |
| Precision | 72.3% | 88.0% | 79.9% | 79.4% |
| F1 Score | 62.1% | 36.3% | 59.1% | 57.2% |
| Cache Hit Rate | 38.5% | 7.8% | 25.5% | 24.9% |
| Token Savings | -0.2% | 10.0% | — | -2.5% |
4.3 Per-Category False Positive Analysis
The near_miss category is the primary source of false positives across all configurations:
| Category | Baseline Hit% | v1 Hit% | v2 Hit% | v2-tight Hit% |
|---|---|---|---|---|
| near_miss | 62.3% | 6.3% | 30.5% | 31.1% |
| paraphrase | 1.7% | 0.8% | 1.3% | 1.3% |
| novel | 0.0% | 0.0% | 0.0% | 0.0% |
| noise | 0.0% | 0.0% | 0.0% | 0.0% |
| replay_target | 79.4% | 20.4% | 59.0% | 57.9% |
Key findings:
- Novel and noise categories achieve 0% false positive rate across ALL configurations — the existing embedding + Jaccard + intent contradiction filters are sufficient for clearly different requests
- Near-miss is the hardest category because these conversations have high semantic similarity to replay targets (same domain, similar tools, close embeddings)
- Consensus v1 reduces near-miss hit rate from 62.3% to 6.3% — an 89.9% reduction
- v2/v2-tight relax consensus for high-confidence matches, which lets near-misses through again at ~31%
4.4 Cache Warming Progression (Consensus v1)
| Round | Cache Hit % | Total Save % | Extra Turns | Guided Tokens |
|---|---|---|---|---|
| R1 | 0.0% | 1.9% | -7 | 0 |
| R2 | 12.4% | 2.8% | +9 | 5,747 |
| R3 | 13.8% | 29.9% | -42 | 3,230 |
| R4 | 6.5% | -24.4% | +21 | 1,325 |
| R5 | 5.2% | 23.3% | -17 | 0 |
Note: The hit rate fluctuation in v1 is expected at low index density. With only 200 situations indexed per round and a requirement of 3 unanimous candidates, the index is often too sparse to form consensus. In production, with thousands of indexed situations, the density would support much higher consensus rates.
5. Discussion
5.1 Why Consensus v1 Wins
The strict unanimous consensus filter (v1) achieves the best false-positive-to-safety tradeoff for three reasons:
-
Near-misses rarely produce 3 concordant high-scoring candidates. A "sell AAPL" query might produce one strong match against a cached "buy AAPL" situation (high embedding similarity), but the 2nd and 3rd candidates are more likely to be from different actions — breaking consensus.
-
The penalty for low recall is acceptable. In safety-critical domains, missing a valid replay (false negative) simply means the request goes to the LLM normally — no harm done, just no cost savings. A false positive replay, however, can cause the agent to take the wrong action.
-
Recall improves with data density. As more situations are indexed, the probability of having 3+ concordant high-scoring candidates for legitimate replay targets increases. The 22.9% recall at 200 situations per round will naturally trend toward 50%+ as the index grows.
5.2 Why Tiered Consensus (v2) Underperforms on FP Rate
We hypothesized (H3) that relaxing the consensus requirement for very high-confidence matches could recover recall without proportionally increasing false positives. This hypothesis was not supported.
The tiered approach (v2, v2-tight) reduced false positives from 94 to 52–53 (vs v1's 14) while recovering recall to 45–47%. The problem: near-miss scenarios often produce embeddings with >0.93 cosine similarity (high enough to trigger the relaxed topK=2 tier), and with only 2 candidates needed, it's much easier for a near-miss to slip through.
Tightening the high-confidence threshold from 0.93 to 0.96 and raising minScore from 0.82 to 0.85 had negligible impact (53 to 52 FPs), confirming that the near-miss embeddings regularly exceed even strict thresholds.
5.3 The Embedding Discrimination Bottleneck
The remaining 14 false positives in v1 represent a fundamental limitation: when 3 or more indexed situations with the same tool name but different intent parameters all score above 0.80 against a near-miss query, the consensus filter cannot distinguish them. This is an embedding discrimination problem — the text-embedding-3-large model maps "What is AAPL trading at?" and "What is TSLA stock price?" to very similar vectors because the semantic structure is nearly identical.
Potential future approaches to address this:
- Fine-tuned embedding models trained on domain-specific near-miss pairs
- Argument-level consensus (check tool argument similarity, not just tool name)
- Retrieval augmented discrimination using a secondary lightweight classifier
5.4 Token Savings Paradox
Interestingly, v1 achieved 10.0% total token savings despite only 7.8% cache hit rate, while the baseline with 38.5% hit rate achieved only -0.2% savings. This is because false positive replays generate extra recovery turns (the agent detects the wrong tool was called and course-corrects), consuming additional tokens that negate the savings. Fewer false positives leads to cleaner conversations and better net savings.
6. Conclusion
We evaluated four consensus filter configurations for LLM decision replay across 4,000 multi-turn benchmark conversations. Strict unanimous consensus (K=3, minScore=0.80) achieves the optimal safety profile:
- 2.5% false positive rate (85% reduction from baseline's 17.1%)
- 88.0% precision (when Decyra replays, it's correct 88% of the time)
- 10.0% total token savings (net positive despite conservative replaying)
- 0% false positives on novel, noise, and paraphrase categories
The primary remaining false positive source is the near-miss category (6.3% hit rate), which represents an embedding discrimination bottleneck rather than a consensus filter limitation.
We recommend deploying consensus v1 as the production configuration, with the expectation that recall (currently 22.9%) will improve as index density grows. The tiered consensus layers (A: confidence-tiered, B: score gap analysis, C: cross-candidate situation text similarity) are retained as configurable knobs for per-agent tuning without redeployment.
Appendix A: Implementation Details
Consensus Filter Algorithm
function checkConsensus(candidates, config):
sorted = candidates.sortByScore(descending)
eligible = sorted.filter(c => c.score >= config.minScore)
if eligible.length < config.topK:
return FAIL("not_enough_candidates")
topK = eligible[0:config.topK]
actions = topK.map(extractActionKey) // tool name or "__completion__"
votes = countVotes(actions)
bestAction, bestCount = maxVote(votes)
if bestCount / topK.length < config.requiredAgreement:
return FAIL("action_disagreement")
return PASS(agreedAction=bestAction)
Action Key Extraction
- tool_call / mixed: tool function name (e.g., "get_stock_price")
- completion: "completion"
- embedding: "embedding"
- Missing/malformed: "unknown"
Default Production Configuration
{
"enabled": true,
"topK": 3,
"minScore": 0.80,
"requiredAgreement": 1.0
}
Appendix B: Full Benchmark Configurations
| Parameter | Baseline | v1 | v2 | v2-tight |
|---|---|---|---|---|
| Consensus enabled | No | Yes | Yes | Yes |
| topK | — | 3 | 3 | 3 |
| minScore | — | 0.80 | 0.82 | 0.85 |
| requiredAgreement | — | 1.0 | 1.0 | 1.0 |
| highConfidenceThreshold | — | — | 0.93 | 0.96 |
| highConfidenceTopK | — | — | 2 | 2 |
| maxScoreGap | — | — | 0.08 | 0.06 |
| minSituationOverlap | — | — | 0.50 | 0.50 |
| minConfidence (static) | 0.85 | 0.85 | 0.85 | 0.85 |
| Jaccard gate | 0.60 | 0.60 | 0.60 | 0.60 |
| Conversations | 1,000 | 1,000 | 1,000 | 1,000 |
| Concurrency | 1 | 4 | 4 | 4 |
| Total time | 208 min | 47 min | 60 min | 126 min |