Anthropic prompt caching: cut Claude API input costs ~90%

If you send the same large context to Claude over and over — a long system prompt, a codebase, a document, a big tool schema — you're paying full input price for those tokens every single call. Prompt caching fixes that: cache the repeated prefix once, and subsequent reads cost a fraction. Here's exactly how to turn it on, when each TTL pays off, and the mistakes that silently stop the cache from hitting.

The 30-second answer

How to enable it

Mark the block you want cached with a cache_control breakpoint. Everything before and including that breakpoint becomes the cached prefix, so put your stable, reused content (system prompt, document, tool definitions) at the top:

import anthropic
client = anthropic.Anthropic()

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_STABLE_CONTEXT,           # e.g. a big doc or instructions
            "cache_control": {"type": "ephemeral"}  # <-- cache everything up to here
        }
    ],
    messages=[{"role": "user", "content": "Question about the document..."}],
)
print(resp.usage)  # check cache_creation_input_tokens vs cache_read_input_tokens

On the first call you'll see cache_creation_input_tokens populated (you paid the write premium). On later calls within the TTL, you'll see cache_read_input_tokens instead — those are the ~90%-off reads. If reads stay at zero, the cache isn't hitting (see troubleshooting below).

For the 1-hour cache

"cache_control": {"type": "ephemeral", "ttl": "1h"}

The exact economics (so you know if it's worth it)

Token typePrice vs. base input
5-minute cache write1.25×
1-hour cache write
Cache read (hit)0.1×

The mental model: you pay a one-time premium to write the cache, then bank a ~90% discount on every read while it's warm. So caching wins when the same prefix is reused several times inside the TTL window. A 50K-token system prompt reused across a chat session, an agent loop sharing the same instructions, or a doc you ask many questions about — all strong wins. A prefix used once and never again — net loss (you paid the write premium for nothing). These multipliers also stack with other modifiers like the Batch API discount.

Which TTL to pick

Why your cache isn't hitting (the usual suspects)

Confirm hits by reading usage.cache_read_input_tokens on the response — that's ground truth, not a guess.

FAQ

Does caching change the model's output? No — it only changes how input tokens are billed and processed. Same model, same response quality.

Is cached data shared across my org? No — caching is isolated. As of February 2026 it uses workspace-level isolation, so caches don't leak across workspaces.

Can I combine it with the Batch API? Yes — the cache multipliers stack with the Batch API discount.


Related

Last updated May 27, 2026. Pricing multipliers, TTLs, and minimums verified against Anthropic's prompt-caching documentation. Anthropic may change these — confirm current values before relying on them in production.