Optimising AI / LLM Cost

After proxying your AI agent's LLM calls through MockServer, you can export a structured optimisation brief — a pre-framed Markdown document that you paste directly into any LLM to get concrete, costed advice on reducing inference spend. No extra context needed; the brief contains everything the downstream LLM requires to reason about your traffic.

MockServer analyses the captured traffic offline, computing nine deterministic optimisation signals (repeated system prompts, low cache-hit rates, oversized tool results, unused tool schema, and more) along with token counts and estimated USD costs from provider pricing tables. MockServer never calls an LLM itself — every number in the report is deterministic and computed locally. An in-product verdict (A–F grade and a "$X recoverable" headline) summarises the findings at a glance.

How It Works

Start MockServer as a proxy and capture LLM traffic — including real traffic from a headless OpenCode run
Export the optimisation brief — via the dashboard, MCP tool, or REST endpoint
Read the verdict — instant A–F grade and "$X recoverable" headline, no LLM required
Act on the detected opportunities — nine signal types, each mapped to a concrete lever

1. Start MockServer as a Proxy and Capture Traffic

Any MockServer instance can act as an HTTPS proxy — no special mode is needed. See AI Traffic Inspection for the full setup, including how to trust the MockServer CA certificate and configure standard environment variables for Node.js, Python, and other tools.

The quick-start for local use:

docker run -d --rm -p 1080:1080 mockserver/mockserver

export HTTPS_PROXY=http://localhost:1080
export NODE_EXTRA_CA_CERTS=/path/to/mockserver-ca.pem   # Node.js tools
export SSL_CERT_FILE=/path/to/mockserver-ca.pem          # Python tools

# Now run your agent — its LLM calls are captured automatically
claude   # or: opencode, python my_agent.py, etc.

MockServer captures every LLM request/response pair (including SSE-streamed completions) in its event log, and these are the source the optimisation report is built from. Both proxied/forwarded traffic (to a real provider) and mocked LLM responses served by MockServer itself are analysed — so you can optimise against real captured runs or against mocked conversations (for example, the data created by npm run demo). LLM traffic is recognised by request shape (the provider's API path), so it works regardless of the upstream host.

Generate real traffic with OpenCode

The offline npm run demo dataset includes a crafted, deterministic agent run that fires every optimisation signal — great for a first look. To analyse your own agent's real spend instead, proxy a headless OpenCode run through MockServer and pointed at a real provider.

Heads-up: this generates real, non-deterministic LLM traffic — it needs network access and your own provider API key, and the captured run varies each time. It is not wired into npm run demo, which stays offline. Never commit or share API keys.

The repository ships a helper script, mockserver-ui/scripts/demo-opencode-proxy.sh, that starts MockServer as an HTTPS proxy with a machine-local CA, extracts that CA, and prints the exact environment to run OpenCode through it:

# Start the proxy and print the run instructions
./mockserver-ui/scripts/demo-opencode-proxy.sh

# …or start the proxy AND run a one-shot OpenCode prompt through it
./mockserver-ui/scripts/demo-opencode-proxy.sh "summarise the README and suggest one improvement"

The equivalent done by hand, so you can run OpenCode in your own shell:

# 1. Start MockServer as an HTTPS proxy with a unique, machine-local CA
#    (the default CA private key is public — never trust it for real traffic).
docker run -d --rm --name mockserver-proxy -p 1080:1080 mockserver/mockserver \
  -serverPort 1080 \
  -Dmockserver.dynamicallyCreateCertificateAuthorityCertificate=true \
  -Dmockserver.directoryToSaveDynamicSSLCertificate=/dynamic-certs

# 2. Trigger CA generation, then copy the CA out of the container
curl -sk https://localhost:1080/ >/dev/null
docker exec mockserver-proxy cat /dynamic-certs/CertificateAuthorityCertificate.pem > mockserver-ca.pem

# 3. Route OpenCode through the proxy and trust the CA
export HTTPS_PROXY=http://localhost:1080
export NODE_EXTRA_CA_CERTS=$PWD/mockserver-ca.pem   # OpenCode is a Node.js tool
export SSL_CERT_FILE=$PWD/mockserver-ca.pem          # any Python helpers

# 4. Point OpenCode at a REAL provider with YOUR OWN API key, then run it headless
export OPENAI_API_KEY=sk-...            # or ANTHROPIC_API_KEY=sk-ant-...
opencode run "summarise the README and suggest one improvement"

OpenCode's LLM calls are now captured by MockServer. Open the dashboard LLM Optimise tab (immediately after Chaos in the navigation bar), or curl the report, to see signals computed from your real run:

curl -s "http://localhost:1080/mockserver/llm/optimisationReport?format=markdown"

2. Export the Optimisation Brief

Three ways to export — choose whichever fits your workflow:

Dashboard — LLM Optimise Screen

Open the MockServer dashboard and click LLM Optimise in the navigation bar (it sits immediately after Chaos):

http://localhost:1080/mockserver/dashboard

The LLM Optimise screen shows:

A session picker — pick a single upstream host, or All captured LLM traffic (the default)
A verdict banner — grade letter (A–F, colour-coded) with the "$X recoverable (Y% of spend)" headline and a one-line rationale
Hero cards — total estimated cost, input/output tokens, call count, average latency, cache-hit rate, and one-shot rate
A Detected Opportunities panel listing each signal with severity chip, estimated saving, and structured fix guidance
A per-call table
Copy verdict — copies a compact plain-text grade + fix summaries to the clipboard (built from the loaded JSON, no extra fetch)
Copy optimisation brief — copies the full Markdown brief to the clipboard
Download bundle — saves the structured JSON report

MCP Tool

If your AI agent is connected to MockServer's MCP control plane (see MCP Setup), use the export_optimisation_report tool:

{
  "method": "tools/call",
  "params": {
    "name": "export_optimisation_report",
    "arguments": {
      "format": "markdown"
    }
  }
}

Returns the full optimisation brief as Markdown text — paste it directly into any LLM.

{
  "method": "tools/call",
  "params": {
    "name": "export_optimisation_report",
    "arguments": {
      "format": "json",
      "host": "api.openai.com"
    }
  }
}

Returns the structured LlmOptimisationReport JSON bundle. The optional host parameter filters to a single upstream host.

REST Endpoint

Call GET /mockserver/llm/optimisationReport directly. This is a MockServer control-plane endpoint; CORS is enabled so the dashboard UI can call it even when the dashboard and control plane are on different hosts or ports.

curl -s "http://localhost:1080/mockserver/llm/optimisationReport?format=markdown"

Returns text/markdown; charset=utf-8 — the full optimisation brief, ready to paste into any LLM.

curl -s "http://localhost:1080/mockserver/llm/optimisationReport?format=json" | python3 -m json.tool

Returns application/json — the LlmOptimisationReport bundle including session metadata, per-call breakdown, detected signals, and redaction status.

# Only OpenAI traffic
curl -s "http://localhost:1080/mockserver/llm/optimisationReport?format=markdown&host=api.openai.com"

# Only Anthropic traffic, JSON
curl -s "http://localhost:1080/mockserver/llm/optimisationReport?format=json&provider=ANTHROPIC"

# A specific named session
curl -s "http://localhost:1080/mockserver/llm/optimisationReport?format=markdown&session=host%3Aapi.openai.com"

Query Parameters

Parameter	Values	Default	Description
`format`	`json` \| `markdown`	`json`	Output format. `markdown` returns the pre-framed optimisation brief; `json` returns the structured `LlmOptimisationReport` bundle.
`session`	grouping key string	all captured LLM traffic	Filter to one session. Sessions are grouped by isolation key (when LLM conversation expectations with session isolation are active) or by upstream `Host` header otherwise. Example: `host:api.openai.com`.
`host`	hostname string	all hosts	Filter to a single upstream host, e.g. `api.openai.com`.
`provider`	`OPENAI` \| `ANTHROPIC` \| `GEMINI` \| `BEDROCK` \| `AZURE_OPENAI` \| `OLLAMA`	all providers	Filter to one LLM provider. Provider is auto-detected from request paths.

If no LLM traffic has been captured yet, the endpoint returns HTTP 200 with an empty report (JSON) or a brief that says "no LLM traffic captured" (Markdown).

What the Markdown Brief Looks Like

The brief is structured in a fixed order so the downstream LLM can reason about it efficiently:

Framing preamble — tells the LLM it is a cost-optimisation expert and what to do
Verdict — A–F grade, rationale, estimated recoverable spend and token count, cache-hit rate, one-shot rate
Run summary — provider(s), model(s), call count, token totals, estimated cost, latency, tool-call count, cache-hit rate, one-shot rate
Per-call table — one row per call: model, input/output tokens, cost, latency, tools, finish reason. For proxied/forwarded traffic the per-call latencyMs is the measured upstream round-trip time (full-stream duration for streaming responses). It is 0 only when the upstream time could not be captured for that call.
Detected opportunities — each detected signal as a section with title, detail, affected call indices, estimated saving, recommendation, and fix guidance (action, config snippet or example expectation, and a docs link where available)
Conversations and tool definitions (appendix) — redacted messages and tool schemas so the LLM has the raw material to propose concrete edits

Once you have the brief, paste it into any LLM. No additional context is needed.

3. The Verdict

Every report includes a deterministic verdict — an A–F grade and a "$X recoverable" headline — computed from the detected signals without any LLM call. The grade is your quick answer to "how much is there to fix here?"

Grade	Meaning
A	<10% of spend is recoverable — well optimised
B	10–25% recoverable — or the grade would be A but at least one HIGH-severity finding exists (the floor is B even when the dollar saving is near zero)
C	25–40% recoverable
D	40–55% recoverable
F	>55% recoverable — significant inefficiency detected

The "recoverable" figure is an estimate of how much spend could be saved by acting on the detected signals. It is calculated using per-call MAX attribution: each call's contribution is the maximum saving across all signals that affect it (not a sum), so the total can never exceed the actual session spend. Treat it as a directional estimate, not an invoice.

The rationale line explains the grade in plain English, for example: "Grade C — an estimated 18% of spend ($1.42) is recoverable across 3 findings (1 high, 2 medium)."

Session KPIs

Two new headline metrics appear in the hero cards and run summary:

Cache hit rate — the fraction of input tokens served from cache (cachedInputTokens / inputTokens). A rate below 50% on a session with repeated prompts suggests caching is not enabled.
One-shot rate — the fraction of calls that were not retries. A windowed retry detector (window of 3) identifies calls that duplicate a recent request; a low one-shot rate points to retry loops or duplicate request patterns worth de-duplicating.

4. The Nine Optimisation Signals

MockServer analyses the captured calls and emits up to nine deterministic signals, sorted by urgency (a combination of severity and how many calls are affected). Each signal names the problem, quantifies it in tokens and estimated USD, and provides structured fix guidance — including a copy-paste config snippet or example expectation where relevant.

Signal	Severity	What it detects	How to fix it
`REPEATED_SYSTEM_PROMPT`	HIGH / MEDIUM	The same system prompt (identified by a fingerprint) is resent on two or more calls, re-paying for the same input tokens each turn. HIGH when the prompt is large (≥1,000 tokens) and repeated three or more times.	Enable provider prompt caching, or move the static context into a retrieval tool so it is only fetched when needed. For Anthropic, the fix includes a ready-to-paste `cache_control` snippet.
`LARGE_STATIC_CONTEXT_RESENT`	HIGH	A large context block (≥2,000 tokens) is resent across two or more calls instead of being cached or retrieved on demand.	Move the large static context into a retrieval tool or enable prompt caching so it is sent once, not every turn.
`DETERMINISTIC_TOOL_CALL`	MEDIUM	The same tool is called with the same arguments on two or more separate calls, making the LLM an unnecessary intermediary for a deterministic lookup.	Replace the LLM-mediated step with a direct HTTP or MCP endpoint call and feed the result back deterministically. An example MockServer expectation is included in the fix guidance.
`OVERSIZED_TOOL_RESULT`	MEDIUM	A tool returned ≥1,000 tokens, which are then re-sent as input on every subsequent turn, inflating cost.	Trim or summarise the tool output before returning it to the model so only the relevant fields are sent.
`OUTPUT_TOKEN_BLOAT`	LOW	One or more calls produced far more output than the median (either ≥1,500 tokens absolute, or ≥3× the median output for the session).	Constrain output with `max_tokens` or a stricter `response_format` / JSON schema so the model returns only what is needed. A ready-to-paste config snippet is included.
`DUPLICATE_CONSECUTIVE_CALL`	MEDIUM	Consecutive calls with a near-identical request shape (same path, model, message count, system prompt fingerprint, and input token count) suggest retries that re-pay for the same work.	De-duplicate or cache identical requests, and only retry on genuine transient errors with backoff.
`LOW_CACHE_HIT_RATE`	HIGH / MEDIUM	The session has a repeated cacheable prompt prefix (same system-prompt fingerprint on two or more calls) but the cache-hit rate is below 50%. Fires only when there are tokens not yet being cached. HIGH when the un-cached token count is ≥2,000 and the cache-hit rate is below 20%; MEDIUM otherwise.	Enable prompt caching for the static prefix. For Anthropic, the fix provides a ready-to-paste `cache_control:{type:ephemeral}` snippet for the system block. For OpenAI and Gemini, automatic prefix caching applies — keep the static prefix byte-identical and place it first; do not interleave volatile content before it.
`MODEL_OVERSPEND`	LOW	Two or more calls produced short outputs (<256 tokens) with no tool calls and no reasoning tokens — "trivial" work — on a model whose blended rate is more than 30% above the provider's cheapest available model.	Switch those calls to the cheaper model. The fix names the specific model and saving percentage, for example: "these 5 calls on `claude-opus-4-6` produced <256-token outputs with no tools or reasoning — a smaller model such as `claude-haiku-4` would likely suffice at ~70% lower cost."
`UNUSED_TOOL_SCHEMA`	MEDIUM / LOW	Tool definitions are sent in the `tools` array on two or more calls but never invoked anywhere in the session. The unused schema tokens are paid for as input on each call. MEDIUM when the total wasted tokens across the session is ≥1,000; LOW otherwise.	Remove the unused tool definitions from `tools`. The fix lists up to five unused tool names and the approximate token saving per call.

Redaction

The optimisation report always strips sensitive headers before including them in either the JSON bundle or the Markdown brief:

Authorization (Bearer tokens, Basic auth)
x-api-key / api-key
Cookie / Set-Cookie
Proxy-Authorization

Body fields are redacted according to the mockserver.fixtureBodyRedactFields configuration property — the same setting used by record_llm_fixtures. The report includes a redaction object describing what was stripped, so you know exactly what the downstream LLM will not see.

Body content is not automatically redacted beyond the configured field list. Review the Markdown brief before pasting it into an external LLM if your prompts or tool results contain sensitive data not covered by fixtureBodyRedactFields.

Configuration

Property	Default	Description
`mockserver.llmOptimisationMaxCalls`	`200`	Maximum number of captured LLM calls included in a single report. When the captured session exceeds this limit, only the most recent calls are analysed. Increase this for long agent runs; reduce it to keep report size manageable when pasting into an LLM with a limited context window.
`mockserver.fixtureBodyRedactFields`	(empty)	Comma-separated list of JSON body field names to redact in the brief (in addition to the always-stripped sensitive headers). Reuses the same setting as `record_llm_fixtures`. See Configuration Properties.

Current Limitations

Cost estimates — costs are computed from MockServer's built-in provider pricing tables (LlmPricing), which may lag provider price changes. The costIsEstimated field in the JSON bundle is true when the provider did not return real usage tokens and MockServer estimated them from the decoded text; it is false when the provider returned real usage data. Treat all cost figures as directional estimates.
Non-LLM traffic — only traffic where MockServer can detect a supported LLM provider (OpenAI, Anthropic, Gemini, Bedrock, Azure OpenAI, Ollama), by API path shape, is included — whether mocked or proxied. Other traffic is ignored.
Session grouping — in v1, the optimisation report groups proxied LLM traffic by upstream host, so the dashboard LLM Optimise session picker offers host-based sessions (and "All captured LLM traffic"). The session REST parameter accepts either the composite host:<host> key or the bare host.

AI Traffic Inspection — proxy setup, CA trust, and how traffic is captured
MCP Setup — connect your AI agent to MockServer's MCP control plane to use the export_optimisation_report tool
MCP Tools Reference — full parameter documentation for export_optimisation_report and related tools
LLM Response Mocking — mock LLM API responses for deterministic, offline testing
Configuration Properties — mockserver.llmOptimisationMaxCalls and mockserver.fixtureBodyRedactFields

AI Integration — See Also

MCP Setup — connect Claude Code, Cursor, Windsurf, Cline, or OpenCode to MockServer's built-in MCP endpoint
MCP Tools Reference — full documentation of all MCP tools, parameters, and resources
Debugging with AI — workflows for using AI assistants to debug HTTP traffic via MCP
AI Traffic Inspection — inspect and record LLM/MCP traffic for debugging and deterministic replay
OpenAPI Contract Verification — verify recorded traffic and run contract/resiliency tests against an OpenAPI spec
OpenAPI for AI — use MockServer's OpenAPI spec as a fallback for AI tools without MCP support
AI Protocol Mocking (MCP & A2A) — mock MCP servers and A2A agents your AI application depends on
LLM Response Mocking — mock LLM API responses from OpenAI, Anthropic, Gemini, Bedrock, Azure OpenAI, and Ollama with provider-correct formatting, streaming, conversations, and chaos
LLM Cost Optimisation — export a one-click optimisation brief (Markdown) or JSON bundle from captured LLM traffic to find ways to cut inference cost
llms.txt — machine-readable index of MockServer documentation for AI assistants and LLMs