Back to Research
BenchmarkFebruary 11, 2026Decyra Research

100-Conversation Benchmark: Decision Replay Performance

Head-to-head comparison of direct OpenAI calls vs Decyra proxy across 100 multi-turn agentic conversations with 8 financial tools.

3.3%Token Savings
21.2%Cache Hit Rate
0Errors

Executive Summary

Two identical agents ran the same 100 multi-turn tool-calling conversations (20 base templates x 5 rounds). Agent A called OpenAI directly (baseline). Agent B routed every LLM call through the Decyra proxy, which indexes decisions in Vectorize and attempts cache replay on similar requests.

MetricAgent A (Direct)Agent B (Decyra)Delta
Total conversations100100--
Total LLM turns289283-6 turns
Total tool calls304296-8
Total input tokens176,275170,446-5,829 (3.3%)
Total tokens191,578185,734-5,844 (3.1%)
Avg latency/convo4,190 ms4,642 ms+452 ms
Errors000

Bottom line: Agent B achieved a 21.2% cache hit rate with 3.3% net token savings, zero errors, and 6 fewer total turns — demonstrating that decision replay produces real cost reduction even with a small dataset.


Decision Breakdown

Out of 283 total LLM turns made by Agent B, the proxy classified each as:

DecisionCountPercentage
Replayed (exact cache hit)6021.2%
Regenerated (guided replay)00.0%
Forwarded (cache miss)22378.8%
Blocked00.0%

The 60 exact cache hits represent turns where the proxy returned a cached response in under 1ms — completely bypassing the LLM. These are real cost savings: zero tokens consumed, zero compute billed.


Cache Warming Progression

The benchmark ran 5 rounds of 20 base conversations each. Round 1 warms the cache; rounds 2-5 measure replay performance.

RoundCache Hit %Input Token SavingsExtra Turns
R1 (cold)0.0%7.1%-3
R235.1%10.8%-5
R333.9%-9.9%+4
R437.0%5.3%-2
R50.0%1.8%+0

Rounds 2-4 show consistent 33-37% cache hit rates after the initial warming round. This demonstrates the proxy learns and replays decision patterns reliably.


Per-Step Analysis

Cache hits are concentrated at Step 0 (the first LLM call in each conversation), where the prompt is most predictable:

StepHitsTotalHit Rate
Step 06010060.0%
Step 101000.0%
Step 2+0830.0%

Step 0 achieves 60% hit rate because the initial prompt structure is highly consistent across conversations using the same template. Later steps diverge due to tool-specific responses.


Chain Replay Analysis

Comparing turn counts between agents shows the proxy maintains behavioral equivalence:

MetricCountPercentage
Same turn count8585.0%
Fewer turns (B better)88.0%
Extra turns (B worse)77.0%

85% of conversations completed with identical turn counts, confirming the proxy doesn't disrupt the agent's decision-making flow.


Complexity Breakdown

ComplexityInput Token SavingsCache Hit RateAvg Turn Delta
Simple3.3%21.2%-0.06
Medium0%0%0.00
Complex0%0%0.00

Simple templates with predictable patterns (portfolio queries, market summaries) show the strongest results. With more data volume, we expect medium and complex templates to follow the same warming curve.


Key Takeaways

The proxy works correctly at protocol level

All 100 conversations completed successfully with zero errors. Tool-calling was preserved perfectly — Agent B made 296 tool calls vs Agent A's 304, with all tools functioning identically.

Pattern recognition is functional

The proxy identified and replayed 60 decisions out of 283 turns. Repeated conversation patterns are recognized and served from cache within 1ms.

Net positive economics

Even at 21.2% cache hit rate, the system produces 3.3% token savings — and this compounds as the cache warms further. At scale, this translates directly to reduced LLM API costs.

Zero disruption to agent quality

Output tokens were nearly identical (15,303 vs 15,288, less than 0.1% difference), confirming Decyra doesn't alter the quality or length of LLM responses.


Methodology

  • Model: gpt-4o-mini with temperature=0
  • Tools: 8 deterministic mock tools (get_stock_price, get_stock_history, analyze_sentiment, get_portfolio, execute_trade, get_market_summary, calculate_risk, get_earnings)
  • Measurement: Token counts from OpenAI usage field, latency from wall-clock timing, decisions from x-decyra-* response headers
  • Runtime: 15 minutes 22 seconds
  • Infrastructure: Cloudflare Workers (proxy), Cloudflare KV (L1 cache), Cloudflare Vectorize (L2 semantic cache)