Introduction
Large Language Model (LLM) agents — autonomous systems that chain multiple API calls to accomplish tasks — present a unique cost and latency challenge. A single agent conversation may involve 3-8 sequential LLM calls, each consuming 2,000-6,000 input tokens for context. The observation that agents operating on similar tasks repeatedly make structurally identical decisions motivates a decision replay architecture: intercept, learn, and replay known-good action patterns without re-invoking the LLM from scratch.
This document describes the current Decyra proxy architecture, evaluates a Random Forest (RoRF) "muscle memory" layer inspired by Not Diamond's model routing work, and proposes a simpler transition probability approach — while arguing that the current semantic matching system already captures most of the available value.
System Architecture
Overview
Decyra operates as a transparent proxy between an AI agent's SDK and the upstream LLM provider (OpenAI, Anthropic, etc.). Every LLM request passes through the proxy, which decides whether to replay a cached response, regenerate via a compressed guided prompt, or forward to the upstream provider.
Agent SDK ──► Decyra Proxy Worker ──► Upstream LLM (OpenAI, etc.)
│ │
│ ┌──────┴──────┐
│ │ L1: KV │ Exact hash match (~1ms)
│ │ L2: Vec │ Semantic situation match (~50ms)
│ │ Guided │ Compressed prompt to same LLM (~300ms)
│ └──────┬──────┘
◄───────────────────┘
Layer 1: Cloudflare KV (Exact Match)
The first cache layer uses deterministic SHA-256 hashing of the full message array to produce a cache key. If the exact same messages, model, temperature, and tool configuration appear again, the cached response is returned in ~1ms with zero compute cost.
Key format: resp:{kvPrefix}:{sha256(canonicalized_signature)}
This handles the simplest case: truly identical requests across repeated conversations.
Layer 2: Cloudflare Vectorize (Semantic Situation Matching)
For semantically similar but not identical requests, the proxy generates a situation embedding — a natural language description of the current request context — and queries Cloudflare Vectorize for similar past situations.
Situation text construction (priority order):
- Last user message (the actual intent — survives truncation)
- Tool results summary (data already in context)
- System prompt summary (agent persona, truncated)
- Available tool names (capabilities, for soft matching)
The embedding model is Cloudflare Workers AI bge-base-en-v1.5 (768-dimensional, MTEB-benchmarked for text retrieval). This runs at zero network latency from a Cloudflare Worker and costs nothing on the free tier.
Metadata filters narrow the candidate pool deterministically:
agentId— tenant isolationstepIndex— chain position (±1 tolerance)resultType— outcome type of the matched candidateprevResultType— chain trajectory compatibilitymodelFamily— model scope enforcement
Guided Replay
When a high-confidence semantic match is found, the proxy doesn't simply return the cached response body. Instead, it extracts the action pattern (tool name, argument schema, completion shape) and sends a compressed prompt to the same LLM the agent intended to use:
System: You are replaying a known decision pattern. A previous, highly similar
situation produced the action described below. Generate the equivalent
action adapted to the current context.
Action type: TOOL CALL
Tool to call: get_stock_price
Expected argument shape: {"ticker":"string"}
User: What's the current price of AAPL?
This prompt is ~500-800 tokens vs. the original ~2,000-6,000, producing real token savings while maintaining context-appropriate argument generation.
Chain-Aware State Tracking
The SDK maintains stateful chain tracking across sequential LLM calls within a conversation:
stepIndex— incremented after each responseprevActionId— the step ID of the previous responseprevResultType— the outcome type of the previous step
These enable the proxy to match candidates not just by situation similarity but by chain trajectory — ensuring temporal coherence in multi-step agent workflows.
Evaluating ML-Based Action Routing
Not Diamond's RoRF Approach
Not Diamond introduced a Routing Random Forest (RoRF) that dynamically selects the optimal LLM for each query from a pool of K models. The approach is effective because: (a) K is small (5-20 models), (b) models have stable performance profiles, and (c) clean training labels are available from benchmark evaluations.
Why RoRF Does Not Transfer to Action Routing
The classification space is fundamentally different. Not Diamond routes between ~5-20 well-defined LLM models. The action space for an agentic system is open-ended — tools are added/removed between deployments, argument schemas evolve, and the (tool x argument_shape) combinations per agent can be in the hundreds.
High-dimensional embeddings defeat Random Forests. Each tree split in an RF examines √n features. With 768-dimensional embeddings, each split sees ~28 features — meaning ~96% of the semantic information is ignored per tree. RFs work best with moderate dimensionality (tens to low hundreds).
Cold start is severe. An RF needs 5,000-10,000 observations for decent accuracy. The current Vectorize approach learns from the first successful execution and can replay from round 2.
The RF predicts WHAT tool to call but not HOW to call it. Knowing the next action is "get_stock_price" is insufficient — the RF cannot predict the correct ticker symbol. The LLM is still needed for argument generation.
Data Efficiency Comparison
| Approach | Min. Examples | Useful From |
|---|---|---|
| KV Exact Cache | 1 | Round 2 |
| Vectorize Semantic | 1 | Round 2 |
| Transition Graph | 1 per transition | Round 2 |
| Random Forest | ~1,000-10,000 | Round N >> 2 |
Latency and Cost Comparison
| Metric | Current System | RF Layer | Transition Graph |
|---|---|---|---|
| L1 hit latency | ~1ms | ~1ms | ~1ms |
| L2 match latency | ~300ms | ~305ms | ~300ms |
| Token savings/replay | 1,200-5,200 | 1,400-5,400 | 1,200-5,200 |
| Memory overhead | ~0 | 2-4MB | ~0 |
| New tools handled | Immediately | Retrain | Immediately |
| Implementation | Deployed | Complex | ~30 LOC |
Scalable Edge Inference on Cloudflare Workers
Resource Constraints
Cloudflare Workers impose strict constraints that favor simple, stateless architectures:
- CPU time: 10ms (free) to 30ms (paid) per invocation
- Memory: 128MB maximum
- Code bundle: 1MB after compression
- No persistent state: All state in KV, D1, R2, or Vectorize
Co-located Service Architecture
The current system leverages Cloudflare's co-located service mesh:
- Workers AI (
bge-base-en-v1.5): Zero-latency embedding generation from within the Worker process - Vectorize: Managed vector database co-located with Workers (~50ms queries)
- KV: Sub-millisecond reads, globally replicated across 300+ data centers
The entire proxy — including signature extraction, KV lookup, embedding generation, Vectorize query, replay decision, and guided prompt construction — executes in a single Worker invocation with a 41KB gzipped bundle and zero external dependencies.
The Zero-Dependency Principle
Every additional dependency in a Worker increases cold start latency, bundle size, and failure surface. The current proxy is 41KB gzipped with zero external ML dependencies. The embedding, similarity search, and metadata filtering all use Cloudflare-native services. This is a significant architectural advantage.
Transition Probability Graph
As a simpler alternative to the RF, we consider a transition probability matrix stored in KV. For an agent at chain position (prevAction, prevResultType, stepIndex), the transition distribution gives the empirical probability of each next action.
Advantages
- Zero inference cost — single KV get (~1ms)
- No cold start — one successful chain provides the full path
- Trivial implementation — ~30 lines of code
- Fully interpretable — readable probability distributions
When Warranted
The transition graph becomes valuable only if data reveals a specific failure mode: high-confidence semantic matches that select candidates with the wrong action type. Without evidence of this, the transition graph is a solution looking for a problem.
Conclusion
The decision replay architecture benefits most from high-quality semantic situation matching with deterministic chain-aware filtering — exactly what the current Vectorize + guided replay system provides. ML-based action routing via Random Forests introduces disproportionate complexity relative to marginal gains.
The most productive path forward is empirically driven: accumulate benchmark data, identify specific failure modes, and address them with targeted interventions. The best architecture is the simplest one that captures the signal — and cosine similarity in a well-chosen embedding space, combined with deterministic metadata filters, already captures most of the learnable structure.
References
- Not Diamond Team. "Routing Random Forests for LLM Model Selection." Technical Report, 2024.
- C. Xiao et al. "C-Pack: Packaged Resources to Advance General Chinese Embedding." arXiv:2309.07597, 2024.
- Cloudflare, Inc. "Vectorize: Vector Database for Cloudflare Workers." Documentation, 2025.
- OpenAI. "Function Calling and Tool Use." API Documentation, 2025.
- L. Breiman. "Random Forests." Machine Learning, 45(1):5-32, 2001.