2026-06-07

Prompt caching: structuring prompts and measuring savings

How to package stable system/user prompts where providers support caching, and how to estimate token savings.

Several vendors (OpenAI, Anthropic, Google, …) expose discounted accounting or separate counters for reused prompt prefixes—often branded “prompt caching”, “context caching”, or cheaper repeated blocks. Policies shift frequently: this article covers measurement methodology, not live dollar math. Always open the vendor pricing page plus their caching guide for the model you call.

What “cache hit” means upstream

Roughly: a deterministic prefix (long system instructions, tool manifests, static RAG bundle) that the vendor already ingested recently is billed differently from “new” prompt text in the current turn. This is not your HTTP cache—it lives inside their infra.

Common false negatives (no hit):

You change one byte in the supposedly static system block (timestamps, random UUIDs).
Identical chunks appear in a different order.
You call a model or route where caching is unsupported—verify the matrix.

Stable prompt structure

Push the largest reusable corpus into system (or a fixed prefix) from immutable templates.
Place user-specific variables (question text, uploaded doc digest) after the stable block—many systems compare prefixes.
Avoid dynamic strings in system (new Date() unless required).

Measurement protocol (A/B)

Baseline: N real requests (e.g. 200) with minimal reuse (or a tiny system prompt).
Cached: same N requests after warming with at least one identical call if the vendor requires priming.
Read usage from responses (or billing exports): total input tokens, optional split like cached_prompt_tokens, and estimated cost using current price sheets.

Compare mean and p95 cost per request—retries in the middle can skew averages.

OpenAI: operational numbers (official cookbook + API guides)

OpenAI’s Prompt Caching 201 cookbook captures mechanics you can rely on in design reviews:

Hits require byte-identical prefixes recently seen upstream—system text, tool definitions, even structured-output schemas participate, not just prose.
Practical threshold called out in the cookbook: prompts around ≥ 1024 tokens, with cache granularity of 128-token chunks on the prefix (reconfirm when docs change).
Responses expose usage.prompt_tokens_details.cached_tokens for Chat/Responses—never guess from string length.
Optional prompt_cache_key (when enabled for your workload) improves colocation for traffic sharing huge static prefixes—read the live Prompt caching guide plus Prompt Caching 201.
Model-specific discounts move quickly (cookbook tables show examples from ~50% off on cached gpt-4o inputs up to ~90% on newer gpt-5.x families)—revalidate against Pricing before quoting dollars.

Engineering tip straight from the cookbook: move timestamps/debug strings into request metadata so they do not poison the prompt prefix.

Anthropic: `cache_control`

Claude documents prompt caching with explicit breakpoints (cache_control: { type: "ephemeral" }) or automatic policies for long chats—TTL and $/token differ from OpenAI, so don’t reuse pricing math. Review ZDR statements if your contract requires it.

Prompt caching: structuring prompts and measuring savings

What “cache hit” means upstream

Stable prompt structure

Measurement protocol (A/B)

OpenAI: operational numbers (official cookbook + API guides)

Anthropic: `cache_control`

Order-of-magnitude expectations

Privacy and invalidation

Summary

Want to ship ideas like these into your product?