Anthropic prompt caching: cut Claude API input costs ~90%
If you send the same large context to Claude over and over — a long system prompt, a codebase, a document, a big tool schema — you're paying full input price for those tokens every single call. Prompt caching fixes that: cache the repeated prefix once, and subsequent reads cost a fraction. Here's exactly how to turn it on, when each TTL pays off, and the mistakes that silently stop the cache from hitting.
The 30-second answer
- What it does: caches a repeated prompt prefix so re-reads cost 0.1× base input price (~90% off the cached portion).
- The cost trade: the first write costs more — 1.25× base input for the 5-minute cache, 2× for the 1-hour cache. You profit once the content is reused enough to beat that one write.
- How: add
"cache_control": {"type": "ephemeral"}to the end of the content block you want cached (put stable, repeated content first in the prompt). - Default TTL is 5 minutes, refreshed on each read; add
"ttl": "1h"for the 1-hour cache.
How to enable it
Mark the block you want cached with a cache_control breakpoint. Everything before and including that breakpoint becomes the cached prefix, so put your stable, reused content (system prompt, document, tool definitions) at the top:
import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_STABLE_CONTEXT, # e.g. a big doc or instructions
"cache_control": {"type": "ephemeral"} # <-- cache everything up to here
}
],
messages=[{"role": "user", "content": "Question about the document..."}],
)
print(resp.usage) # check cache_creation_input_tokens vs cache_read_input_tokens
On the first call you'll see cache_creation_input_tokens populated (you paid the write premium). On later calls within the TTL, you'll see cache_read_input_tokens instead — those are the ~90%-off reads. If reads stay at zero, the cache isn't hitting (see troubleshooting below).
For the 1-hour cache
"cache_control": {"type": "ephemeral", "ttl": "1h"}
The exact economics (so you know if it's worth it)
| Token type | Price vs. base input |
|---|---|
| 5-minute cache write | 1.25× |
| 1-hour cache write | 2× |
| Cache read (hit) | 0.1× |
The mental model: you pay a one-time premium to write the cache, then bank a ~90% discount on every read while it's warm. So caching wins when the same prefix is reused several times inside the TTL window. A 50K-token system prompt reused across a chat session, an agent loop sharing the same instructions, or a doc you ask many questions about — all strong wins. A prefix used once and never again — net loss (you paid the write premium for nothing). These multipliers also stack with other modifiers like the Batch API discount.
Which TTL to pick
- 5-minute (default): continuous workloads — a live chat, an agent actively looping, back-to-back requests. Each read refreshes the 5-minute clock, so steady traffic keeps it warm for free.
- 1-hour: bursty or intermittent reuse — a user who asks a question, thinks for 15 minutes, then asks another against the same document. The 5-minute cache would expire between turns; the 1-hour cache survives. You pay 2× on the write, so only use it when gaps would otherwise blow the 5-minute window.
Why your cache isn't hitting (the usual suspects)
- Below the minimum length. There's a minimum cacheable size — e.g. 2048 tokens on Haiku models. A short prefix won't cache even with
cache_controlset. - The prefix changed. Caching keys on an exact prefix match. If anything before your breakpoint differs by even one token (a timestamp, a reordered tool, a per-user line at the top), it's a cache miss. Keep volatile content after the cached block.
- The TTL expired. Gaps longer than your TTL (5 min or 1 hr) drop the cache; the next call is a fresh write.
- Wrong ordering. Stable, cacheable content must come first; user-specific/variable content goes last.
Confirm hits by reading usage.cache_read_input_tokens on the response — that's ground truth, not a guess.
FAQ
Does caching change the model's output? No — it only changes how input tokens are billed and processed. Same model, same response quality.
Is cached data shared across my org? No — caching is isolated. As of February 2026 it uses workspace-level isolation, so caches don't leak across workspaces.
Can I combine it with the Batch API? Yes — the cache multipliers stack with the Batch API discount.
Related
- Is the Anthropic API cheaper than Claude Pro?
- Claude API 529 overloaded_error — causes & fix
- API vs subscription — when the API is actually cheaper
Last updated May 27, 2026. Pricing multipliers, TTLs, and minimums verified against Anthropic's prompt-caching documentation. Anthropic may change these — confirm current values before relying on them in production.