Anthropic token counting API: Python guide
Anthropic provides a count_tokens() method that returns the exact token count for a request before you send it. Unlike OpenAI (where you use the offline tiktoken library), Anthropic's count is a real API call — it's accurate, includes all formatting overhead, and accounts for system prompts and tool definitions. Use it to check context limits, estimate costs, and truncate inputs before they hit the 200K context limit.
The 30-second answer
- Method:
client.messages.count_tokens(model=..., messages=[...], system=...) - Returns: a
TokenCountResponsewith aninput_tokensfield — exact server-side count. - Pass the same params you'll use in
messages.create()for an accurate count (system, tools, etc.). - No output tokens —
count_tokens()only counts input tokens since the response isn't generated.
Basic token count
import anthropic
client = anthropic.Anthropic()
messages = [
{"role": "user", "content": "Explain transformer architecture in simple terms."}
]
count_response = client.messages.count_tokens(
model="claude-sonnet-4-5",
messages=messages
)
print(f"Input tokens: {count_response.input_tokens}") # e.g. 19
The count_tokens() call does not generate a response — it only counts. It's a lightweight round-trip to the API. For long documents or complex multi-turn conversations, use it before the actual messages.create() call to verify you're within limits.
Counting with system prompt and tools
Pass the same parameters you'll use in the actual request. System prompts and tool definitions both consume input tokens:
tools = [
{
"name": "get_weather",
"description": "Get current weather for a city.",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
]
count_response = client.messages.count_tokens(
model="claude-sonnet-4-5",
system="You are a helpful weather assistant. Always provide temperature in the requested unit.",
messages=[
{"role": "user", "content": "What's the weather in Tokyo?"}
],
tools=tools
)
print(f"Input tokens (with system + tools): {count_response.input_tokens}")
Tool definitions can be surprisingly token-heavy — each tool adds its name, description, and schema as input tokens. If you have many tools, count the full request before caching patterns to understand the fixed cost per call.
Context limit check before sending
MODEL_CONTEXT_LIMITS = {
"claude-opus-4-6": 200_000,
"claude-sonnet-4-6": 200_000,
"claude-sonnet-4-5": 200_000,
"claude-haiku-4-5": 200_000,
}
def fits_context(messages: list[dict], model: str, system: str = None,
max_output_tokens: int = 4096) -> bool:
"""Check if the request fits within the model's context window."""
kwargs = {"model": model, "messages": messages}
if system:
kwargs["system"] = system
count = client.messages.count_tokens(**kwargs)
limit = MODEL_CONTEXT_LIMITS.get(model, 200_000)
total = count.input_tokens + max_output_tokens
if total > limit:
print(f"Over limit: {count.input_tokens} input + {max_output_tokens} output "
f"= {total} > {limit}")
return False
return True
All current Claude models have a 200K token context window. The 200K limit covers input + output combined — if you're using max_tokens=8192, you have 191,808 tokens for input.
Cost estimation
# May 2026 pricing (per MTok)
PRICING = {
"claude-opus-4-6": {"input": 15.00, "output": 75.00},
"claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
"claude-haiku-4-5": {"input": 0.80, "output": 4.00},
}
def estimate_cost(messages: list[dict], model: str, system: str = None,
expected_output_tokens: int = 500) -> float:
"""Estimate cost in USD before sending."""
kwargs = {"model": model, "messages": messages}
if system:
kwargs["system"] = system
count = client.messages.count_tokens(**kwargs)
p = PRICING.get(model, PRICING["claude-sonnet-4-5"])
input_cost = (count.input_tokens / 1_000_000) * p["input"]
output_cost = (expected_output_tokens / 1_000_000) * p["output"]
total = input_cost + output_cost
print(f"Input: {count.input_tokens} tokens (${input_cost:.5f})")
print(f"Output: ~{expected_output_tokens} tokens (${output_cost:.5f})")
print(f"Estimated total: ${total:.5f}")
return total
estimate_cost(
messages=[{"role": "user", "content": "Summarize this 50-page document..."}],
model="claude-sonnet-4-5",
expected_output_tokens=1000
)
Tracking actual usage
After a successful messages.create() call, the actual usage is in response.usage:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}]
)
print(f"Actual input tokens: {response.usage.input_tokens}")
print(f"Actual output tokens: {response.usage.output_tokens}")
# If prompt caching is active:
# response.usage.cache_read_input_tokens — tokens served from cache
# response.usage.cache_creation_input_tokens — tokens written to cache
Use response.usage for billing reconciliation — it's the authoritative count. Use count_tokens() for pre-flight checks and cost estimation. The two should match within a few tokens for typical requests.
count_tokens vs tiktoken (OpenAI comparison)
| Aspect | Anthropic count_tokens() | OpenAI tiktoken |
|---|---|---|
| How it works | Live API call | Local library (offline) |
| Accuracy | Exact server-side count | Approximate (±1-2 tokens) |
| Speed | ~100-300ms round-trip | Milliseconds (local) |
| Includes system/tool overhead | Yes, natively | Requires manual formula |
| Cost | Free (no charge for count calls) | Free (local) |
There is no official offline tokenizer for Claude — Anthropic uses a custom BPE tokenizer that isn't publicly released. The count_tokens() API is the recommended approach. For high-throughput pre-flight checks where latency matters, cache the token count for repeated identical prompts.
FAQ
Does count_tokens include system prompt tokens? Yes — pass all the same parameters you plan to use in messages.create() for an accurate count.
Is count_tokens free? Yes — Anthropic doesn't charge for token counting calls.
count_tokens vs tiktoken? count_tokens is a live API call that returns an exact count. tiktoken is a local library that approximates. For Claude, count_tokens is the only accurate option since there's no public offline tokenizer.
Last updated May 28, 2026. Code examples verified against the Anthropic Python SDK and token counting documentation. API behaviour and pricing may change — confirm against the official docs before deploying to production.