2026-06-07
Prompt caching: structuring prompts and measuring savings
How to package stable system/user prompts where providers support caching, and how to estimate token savings.
Several vendors (OpenAI, Anthropic, Google, …) expose discounted accounting or separate counters for reused prompt prefixes—often branded “prompt caching”, “context caching”, or cheaper repeated blocks. Policies shift frequently: this article covers measurement methodology, not live dollar math. Always open the vendor pricing page plus their caching guide for the model you call.
What “cache hit” means upstream
Roughly: a deterministic prefix (long system instructions, tool manifests, static RAG bundle) that the vendor already ingested recently is billed differently from “new” prompt text in the current turn. This is not your HTTP cache—it lives inside their infra.
Common false negatives (no hit):
- You change one byte in the supposedly static system block (timestamps, random UUIDs).
- Identical chunks appear in a different order.
- You call a model or route where caching is unsupported—verify the matrix.
Stable prompt structure
- Push the largest reusable corpus into system (or a fixed prefix) from immutable templates.
- Place user-specific variables (question text, uploaded doc digest) after the stable block—many systems compare prefixes.
- Avoid dynamic strings in system (
new Date()unless required).
Measurement protocol (A/B)
- Baseline: N real requests (e.g. 200) with minimal reuse (or a tiny system prompt).
- Cached: same N requests after warming with at least one identical call if the vendor requires priming.
- Read
usagefrom responses (or billing exports): total input tokens, optional split likecached_prompt_tokens, and estimated cost using current price sheets.
Compare mean and p95 cost per request—retries in the middle can skew averages.
OpenAI: operational numbers (official cookbook + API guides)
OpenAI’s Prompt Caching 201 cookbook captures mechanics you can rely on in design reviews:
- Hits require byte-identical prefixes recently seen upstream—system text, tool definitions, even structured-output schemas participate, not just prose.
- Practical threshold called out in the cookbook: prompts around ≥ 1024 tokens, with cache granularity of 128-token chunks on the prefix (reconfirm when docs change).
- Responses expose
usage.prompt_tokens_details.cached_tokensfor Chat/Responses—never guess from string length. - Optional
prompt_cache_key(when enabled for your workload) improves colocation for traffic sharing huge static prefixes—read the live Prompt caching guide plus Prompt Caching 201. - Model-specific discounts move quickly (cookbook tables show examples from ~50% off on cached
gpt-4oinputs up to ~90% on newergpt-5.xfamilies)—revalidate against Pricing before quoting dollars.
Engineering tip straight from the cookbook: move timestamps/debug strings into request metadata so they do not poison the prompt prefix.
Anthropic: cache_control
Claude documents prompt caching with explicit breakpoints (cache_control: { type: "ephemeral" }) or automatic policies for long chats—TTL and $/token differ from OpenAI, so don’t reuse pricing math. Review ZDR statements if your contract requires it.
Order-of-magnitude expectations
Vendor-published savings targets high-repeat workloads; if most tokens are unique user questions, do not expect large wins. Measure anyway.
Privacy and invalidation
Sensitive data inside a shared cached prefix can violate compliance in multi-tenant setups—confirm per-tenant isolation and retention statements.
Summary
Freeze the prefix, push variability to the tail, measure before/after using real token usage fields from API responses. Marketing percentages never replace usage.prompt_tokens_details (or equivalent) in your logs.