Anthropic token counting API: Python guide

Anthropic provides a count_tokens() method that returns the exact token count for a request before you send it. Unlike OpenAI (where you use the offline tiktoken library), Anthropic's count is a real API call — it's accurate, includes all formatting overhead, and accounts for system prompts and tool definitions. Use it to check context limits, estimate costs, and truncate inputs before they hit the 200K context limit.

The 30-second answer

Basic token count

import anthropic

client = anthropic.Anthropic()

messages = [
    {"role": "user", "content": "Explain transformer architecture in simple terms."}
]

count_response = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    messages=messages
)

print(f"Input tokens: {count_response.input_tokens}")  # e.g. 19

The count_tokens() call does not generate a response — it only counts. It's a lightweight round-trip to the API. For long documents or complex multi-turn conversations, use it before the actual messages.create() call to verify you're within limits.

Counting with system prompt and tools

Pass the same parameters you'll use in the actual request. System prompts and tool definitions both consume input tokens:

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["city"]
        }
    }
]

count_response = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    system="You are a helpful weather assistant. Always provide temperature in the requested unit.",
    messages=[
        {"role": "user", "content": "What's the weather in Tokyo?"}
    ],
    tools=tools
)

print(f"Input tokens (with system + tools): {count_response.input_tokens}")

Tool definitions can be surprisingly token-heavy — each tool adds its name, description, and schema as input tokens. If you have many tools, count the full request before caching patterns to understand the fixed cost per call.

Context limit check before sending

MODEL_CONTEXT_LIMITS = {
    "claude-opus-4-6": 200_000,
    "claude-sonnet-4-6": 200_000,
    "claude-sonnet-4-5": 200_000,
    "claude-haiku-4-5": 200_000,
}

def fits_context(messages: list[dict], model: str, system: str = None,
                 max_output_tokens: int = 4096) -> bool:
    """Check if the request fits within the model's context window."""
    kwargs = {"model": model, "messages": messages}
    if system:
        kwargs["system"] = system

    count = client.messages.count_tokens(**kwargs)
    limit = MODEL_CONTEXT_LIMITS.get(model, 200_000)

    total = count.input_tokens + max_output_tokens
    if total > limit:
        print(f"Over limit: {count.input_tokens} input + {max_output_tokens} output "
              f"= {total} > {limit}")
        return False
    return True

All current Claude models have a 200K token context window. The 200K limit covers input + output combined — if you're using max_tokens=8192, you have 191,808 tokens for input.

Cost estimation

# May 2026 pricing (per MTok)
PRICING = {
    "claude-opus-4-6":    {"input": 15.00, "output": 75.00},
    "claude-sonnet-4-6":  {"input":  3.00, "output": 15.00},
    "claude-sonnet-4-5":  {"input":  3.00, "output": 15.00},
    "claude-haiku-4-5":   {"input":  0.80, "output":  4.00},
}

def estimate_cost(messages: list[dict], model: str, system: str = None,
                  expected_output_tokens: int = 500) -> float:
    """Estimate cost in USD before sending."""
    kwargs = {"model": model, "messages": messages}
    if system:
        kwargs["system"] = system

    count = client.messages.count_tokens(**kwargs)
    p = PRICING.get(model, PRICING["claude-sonnet-4-5"])

    input_cost = (count.input_tokens / 1_000_000) * p["input"]
    output_cost = (expected_output_tokens / 1_000_000) * p["output"]
    total = input_cost + output_cost

    print(f"Input: {count.input_tokens} tokens (${input_cost:.5f})")
    print(f"Output: ~{expected_output_tokens} tokens (${output_cost:.5f})")
    print(f"Estimated total: ${total:.5f}")
    return total

estimate_cost(
    messages=[{"role": "user", "content": "Summarize this 50-page document..."}],
    model="claude-sonnet-4-5",
    expected_output_tokens=1000
)

Tracking actual usage

After a successful messages.create() call, the actual usage is in response.usage:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)

print(f"Actual input tokens: {response.usage.input_tokens}")
print(f"Actual output tokens: {response.usage.output_tokens}")

# If prompt caching is active:
# response.usage.cache_read_input_tokens  — tokens served from cache
# response.usage.cache_creation_input_tokens  — tokens written to cache

Use response.usage for billing reconciliation — it's the authoritative count. Use count_tokens() for pre-flight checks and cost estimation. The two should match within a few tokens for typical requests.

count_tokens vs tiktoken (OpenAI comparison)

AspectAnthropic count_tokens()OpenAI tiktoken
How it worksLive API callLocal library (offline)
AccuracyExact server-side countApproximate (±1-2 tokens)
Speed~100-300ms round-tripMilliseconds (local)
Includes system/tool overheadYes, nativelyRequires manual formula
CostFree (no charge for count calls)Free (local)

There is no official offline tokenizer for Claude — Anthropic uses a custom BPE tokenizer that isn't publicly released. The count_tokens() API is the recommended approach. For high-throughput pre-flight checks where latency matters, cache the token count for repeated identical prompts.

FAQ

Does count_tokens include system prompt tokens? Yes — pass all the same parameters you plan to use in messages.create() for an accurate count.

Is count_tokens free? Yes — Anthropic doesn't charge for token counting calls.

count_tokens vs tiktoken? count_tokens is a live API call that returns an exact count. tiktoken is a local library that approximates. For Claude, count_tokens is the only accurate option since there's no public offline tokenizer.

Last updated May 28, 2026. Code examples verified against the Anthropic Python SDK and token counting documentation. API behaviour and pricing may change — confirm against the official docs before deploying to production.