Decision Replay Architectures for Agentic AI

Introduction

Large Language Model (LLM) agents — autonomous systems that chain multiple API calls to accomplish tasks — present a unique cost and latency challenge. A single agent conversation may involve 3-8 sequential LLM calls, each consuming 2,000-6,000 input tokens for context. The observation that agents operating on similar tasks repeatedly make structurally identical decisions motivates a decision replay architecture: intercept, learn, and replay known-good action patterns without re-invoking the LLM from scratch.

This document describes the current Decyra proxy architecture, evaluates a Random Forest (RoRF) "muscle memory" layer inspired by Not Diamond's model routing work, and proposes a simpler transition probability approach — while arguing that the current semantic matching system already captures most of the available value.

System Architecture

Overview

Decyra operates as a transparent proxy between an AI agent's SDK and the upstream LLM provider (OpenAI, Anthropic, etc.). Every LLM request passes through the proxy, which decides whether to replay a cached response, regenerate via a compressed guided prompt, or forward to the upstream provider.

Agent SDK  ──►  Decyra Proxy Worker  ──►  Upstream LLM (OpenAI, etc.)
   │                   │
   │            ┌──────┴──────┐
   │            │   L1: KV    │  Exact hash match (~1ms)
   │            │   L2: Vec   │  Semantic situation match (~50ms)
   │            │   Guided    │  Compressed prompt to same LLM (~300ms)
   │            └──────┬──────┘
   ◄───────────────────┘

Layer 1: Cloudflare KV (Exact Match)

The first cache layer uses deterministic SHA-256 hashing of the full message array to produce a cache key. If the exact same messages, model, temperature, and tool configuration appear again, the cached response is returned in ~1ms with zero compute cost.

Key format: resp:{kvPrefix}:{sha256(canonicalized_signature)}

This handles the simplest case: truly identical requests across repeated conversations.

Layer 2: Cloudflare Vectorize (Semantic Situation Matching)

For semantically similar but not identical requests, the proxy generates a situation embedding — a natural language description of the current request context — and queries Cloudflare Vectorize for similar past situations.

Situation text construction (priority order):

Last user message (the actual intent — survives truncation)
Tool results summary (data already in context)
System prompt summary (agent persona, truncated)
Available tool names (capabilities, for soft matching)

The embedding model is Cloudflare Workers AI bge-base-en-v1.5 (768-dimensional, MTEB-benchmarked for text retrieval). This runs at zero network latency from a Cloudflare Worker and costs nothing on the free tier.

Metadata filters narrow the candidate pool deterministically:

agentId — tenant isolation
stepIndex — chain position (±1 tolerance)
resultType — outcome type of the matched candidate
prevResultType — chain trajectory compatibility
modelFamily — model scope enforcement

Guided Replay

When a high-confidence semantic match is found, the proxy doesn't simply return the cached response body. Instead, it extracts the action pattern (tool name, argument schema, completion shape) and sends a compressed prompt to the same LLM the agent intended to use:

System: You are replaying a known decision pattern. A previous, highly similar
        situation produced the action described below. Generate the equivalent
        action adapted to the current context.
        
        Action type: TOOL CALL
        Tool to call: get_stock_price
        Expected argument shape: {"ticker":"string"}
        
User: What's the current price of AAPL?

This prompt is ~500-800 tokens vs. the original ~2,000-6,000, producing real token savings while maintaining context-appropriate argument generation.

Chain-Aware State Tracking

The SDK maintains stateful chain tracking across sequential LLM calls within a conversation:

stepIndex — incremented after each response
prevActionId — the step ID of the previous response
prevResultType — the outcome type of the previous step

These enable the proxy to match candidates not just by situation similarity but by chain trajectory — ensuring temporal coherence in multi-step agent workflows.

Evaluating ML-Based Action Routing

Not Diamond's RoRF Approach

Not Diamond introduced a Routing Random Forest (RoRF) that dynamically selects the optimal LLM for each query from a pool of K models. The approach is effective because: (a) K is small (5-20 models), (b) models have stable performance profiles, and (c) clean training labels are available from benchmark evaluations.

Why RoRF Does Not Transfer to Action Routing

The classification space is fundamentally different. Not Diamond routes between ~5-20 well-defined LLM models. The action space for an agentic system is open-ended — tools are added/removed between deployments, argument schemas evolve, and the (tool x argument_shape) combinations per agent can be in the hundreds.

High-dimensional embeddings defeat Random Forests. Each tree split in an RF examines √n features. With 768-dimensional embeddings, each split sees ~28 features — meaning ~96% of the semantic information is ignored per tree. RFs work best with moderate dimensionality (tens to low hundreds).

Cold start is severe. An RF needs 5,000-10,000 observations for decent accuracy. The current Vectorize approach learns from the first successful execution and can replay from round 2.

The RF predicts WHAT tool to call but not HOW to call it. Knowing the next action is "get_stock_price" is insufficient — the RF cannot predict the correct ticker symbol. The LLM is still needed for argument generation.

Data Efficiency Comparison

Approach	Min. Examples	Useful From
KV Exact Cache	1	Round 2
Vectorize Semantic	1	Round 2
Transition Graph	1 per transition	Round 2
Random Forest	~1,000-10,000	Round N >> 2

Latency and Cost Comparison

Metric	Current System	RF Layer	Transition Graph
L1 hit latency	~1ms	~1ms	~1ms
L2 match latency	~300ms	~305ms	~300ms
Token savings/replay	1,200-5,200	1,400-5,400	1,200-5,200
Memory overhead	~0	2-4MB	~0
New tools handled	Immediately	Retrain	Immediately
Implementation	Deployed	Complex	~30 LOC

Scalable Edge Inference on Cloudflare Workers

Resource Constraints

Cloudflare Workers impose strict constraints that favor simple, stateless architectures:

CPU time: 10ms (free) to 30ms (paid) per invocation
Memory: 128MB maximum
Code bundle: 1MB after compression
No persistent state: All state in KV, D1, R2, or Vectorize

Co-located Service Architecture

The current system leverages Cloudflare's co-located service mesh:

Workers AI (bge-base-en-v1.5): Zero-latency embedding generation from within the Worker process
Vectorize: Managed vector database co-located with Workers (~50ms queries)
KV: Sub-millisecond reads, globally replicated across 300+ data centers

The entire proxy — including signature extraction, KV lookup, embedding generation, Vectorize query, replay decision, and guided prompt construction — executes in a single Worker invocation with a 41KB gzipped bundle and zero external dependencies.

The Zero-Dependency Principle

Every additional dependency in a Worker increases cold start latency, bundle size, and failure surface. The current proxy is 41KB gzipped with zero external ML dependencies. The embedding, similarity search, and metadata filtering all use Cloudflare-native services. This is a significant architectural advantage.

Transition Probability Graph

As a simpler alternative to the RF, we consider a transition probability matrix stored in KV. For an agent at chain position (prevAction, prevResultType, stepIndex), the transition distribution gives the empirical probability of each next action.

Advantages

Zero inference cost — single KV get (~1ms)
No cold start — one successful chain provides the full path
Trivial implementation — ~30 lines of code
Fully interpretable — readable probability distributions

When Warranted

The transition graph becomes valuable only if data reveals a specific failure mode: high-confidence semantic matches that select candidates with the wrong action type. Without evidence of this, the transition graph is a solution looking for a problem.

Conclusion

The decision replay architecture benefits most from high-quality semantic situation matching with deterministic chain-aware filtering — exactly what the current Vectorize + guided replay system provides. ML-based action routing via Random Forests introduces disproportionate complexity relative to marginal gains.

The most productive path forward is empirically driven: accumulate benchmark data, identify specific failure modes, and address them with targeted interventions. The best architecture is the simplest one that captures the signal — and cosine similarity in a well-chosen embedding space, combined with deterministic metadata filters, already captures most of the learnable structure.

References

Not Diamond Team. "Routing Random Forests for LLM Model Selection." Technical Report, 2024.
C. Xiao et al. "C-Pack: Packaged Resources to Advance General Chinese Embedding." arXiv:2309.07597, 2024.
Cloudflare, Inc. "Vectorize: Vector Database for Cloudflare Workers." Documentation, 2025.
OpenAI. "Function Calling and Tool Use." API Documentation, 2025.
L. Breiman. "Random Forests." Machine Learning, 45(1):5-32, 2001.

All research Try Decyra free