How much does Anthropic prompt caching save?

Cache read tokens cost 0.1x the base input price — about a 90% discount on the cached portion of your prompt. The trade-off is the cache write: a 5-minute cache write costs 1.25x base input, and a 1-hour cache write costs 2x. You come out ahead once the cached content is reused enough to outweigh that one write.

What is the difference between the 5-minute and 1-hour cache?

The default cache TTL is 5 minutes, refreshed each time the cached content is read. The 1-hour TTL ("ttl": "1h") keeps content cached longer for bursty or intermittent workloads, but its write costs 2x base input vs 1.25x for the 5-minute cache.

Why is my Claude prompt cache not hitting?

Common causes: the cached block is below the minimum cacheable length (e.g. 2048 tokens on Haiku models), something before the cache_control breakpoint changed (caching keys on an exact prefix match), or the TTL expired between calls. Cached content must be identical and come first in the prompt.

Anthropic prompt caching: cut Claude API input costs ~90%

If you send the same large context to Claude over and over — a long system prompt, a codebase, a document, a big tool schema — you're paying full input price for those tokens every single call. Prompt caching fixes that: cache the repeated prefix once, and subsequent reads cost a fraction. Here's exactly how to turn it on, when each TTL pays off, and the mistakes that silently stop the cache from hitting.

The 30-second answer

What it does: caches a repeated prompt prefix so re-reads cost 0.1× base input price (~90% off the cached portion).
The cost trade: the first write costs more — 1.25× base input for the 5-minute cache, 2× for the 1-hour cache. You profit once the content is reused enough to beat that one write.
How: add "cache_control": {"type": "ephemeral"} to the end of the content block you want cached (put stable, repeated content first in the prompt).
Default TTL is 5 minutes, refreshed on each read; add "ttl": "1h" for the 1-hour cache.

How to enable it

Mark the block you want cached with a cache_control breakpoint. Everything before and including that breakpoint becomes the cached prefix, so put your stable, reused content (system prompt, document, tool definitions) at the top:

import anthropic
client = anthropic.Anthropic()

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_STABLE_CONTEXT,           # e.g. a big doc or instructions
            "cache_control": {"type": "ephemeral"}  # <-- cache everything up to here
        }
    ],
    messages=[{"role": "user", "content": "Question about the document..."}],
)
print(resp.usage)  # check cache_creation_input_tokens vs cache_read_input_tokens

On the first call you'll see cache_creation_input_tokens populated (you paid the write premium). On later calls within the TTL, you'll see cache_read_input_tokens instead — those are the ~90%-off reads. If reads stay at zero, the cache isn't hitting (see troubleshooting below).

For the 1-hour cache

"cache_control": {"type": "ephemeral", "ttl": "1h"}

The exact economics (so you know if it's worth it)

Token type	Price vs. base input
5-minute cache write	1.25×
1-hour cache write	2×
Cache read (hit)	0.1×

The mental model: you pay a one-time premium to write the cache, then bank a ~90% discount on every read while it's warm. So caching wins when the same prefix is reused several times inside the TTL window. A 50K-token system prompt reused across a chat session, an agent loop sharing the same instructions, or a doc you ask many questions about — all strong wins. A prefix used once and never again — net loss (you paid the write premium for nothing). These multipliers also stack with other modifiers like the Batch API discount.

Which TTL to pick

5-minute (default): continuous workloads — a live chat, an agent actively looping, back-to-back requests. Each read refreshes the 5-minute clock, so steady traffic keeps it warm for free.
1-hour: bursty or intermittent reuse — a user who asks a question, thinks for 15 minutes, then asks another against the same document. The 5-minute cache would expire between turns; the 1-hour cache survives. You pay 2× on the write, so only use it when gaps would otherwise blow the 5-minute window.

Why your cache isn't hitting (the usual suspects)

Below the minimum length. There's a minimum cacheable size — e.g. 2048 tokens on Haiku models. A short prefix won't cache even with cache_control set.
The prefix changed. Caching keys on an exact prefix match. If anything before your breakpoint differs by even one token (a timestamp, a reordered tool, a per-user line at the top), it's a cache miss. Keep volatile content after the cached block.
The TTL expired. Gaps longer than your TTL (5 min or 1 hr) drop the cache; the next call is a fresh write.
Wrong ordering. Stable, cacheable content must come first; user-specific/variable content goes last.

Confirm hits by reading usage.cache_read_input_tokens on the response — that's ground truth, not a guess.

FAQ

Does caching change the model's output? No — it only changes how input tokens are billed and processed. Same model, same response quality.

Is cached data shared across my org? No — caching is isolated. As of February 2026 it uses workspace-level isolation, so caches don't leak across workspaces.

Can I combine it with the Batch API? Yes — the cache multipliers stack with the Batch API discount.

Last updated May 27, 2026. Pricing multipliers, TTLs, and minimums verified against Anthropic's prompt-caching documentation. Anthropic may change these — confirm current values before relying on them in production.