How do I count tokens for Claude before sending an API request?

Use client.messages.count_tokens(). Pass the same model, messages, and system parameters you plan to use in messages.create(). The response has an input_tokens field with the exact token count. This is a lightweight API call — it does not generate a response, just returns the token count.

Does Anthropic's count_tokens include system prompt tokens?

Yes. If you pass a system parameter to count_tokens(), those tokens are included in the input_tokens count. Same for tools — if you include tools in the count_tokens call, the tool definition tokens are counted. Match your count_tokens call to exactly what you'll send in messages.create() for an accurate pre-flight estimate.

What is the difference between Anthropic count_tokens and OpenAI tiktoken?

Anthropic's count_tokens() is a live API call that returns the exact server-side token count, including message formatting overhead. OpenAI's tiktoken is a local library that approximates the token count — it's faster but has a small error margin due to message overhead calculation. For Claude, there is no official offline tokenizer, so count_tokens() is the recommended approach for accurate pre-flight counting.

Anthropic token counting API: Python guide

Anthropic provides a count_tokens() method that returns the exact token count for a request before you send it. Unlike OpenAI (where you use the offline tiktoken library), Anthropic's count is a real API call — it's accurate, includes all formatting overhead, and accounts for system prompts and tool definitions. Use it to check context limits, estimate costs, and truncate inputs before they hit the 200K context limit.

The 30-second answer

Method: client.messages.count_tokens(model=..., messages=[...], system=...)
Returns: a TokenCountResponse with an input_tokens field — exact server-side count.
Pass the same params you'll use in messages.create() for an accurate count (system, tools, etc.).
No output tokens — count_tokens() only counts input tokens since the response isn't generated.

Basic token count

import anthropic

client = anthropic.Anthropic()

messages = [
    {"role": "user", "content": "Explain transformer architecture in simple terms."}
]

count_response = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    messages=messages
)

print(f"Input tokens: {count_response.input_tokens}")  # e.g. 19

The count_tokens() call does not generate a response — it only counts. It's a lightweight round-trip to the API. For long documents or complex multi-turn conversations, use it before the actual messages.create() call to verify you're within limits.

Counting with system prompt and tools

Pass the same parameters you'll use in the actual request. System prompts and tool definitions both consume input tokens:

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["city"]
        }
    }
]

count_response = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    system="You are a helpful weather assistant. Always provide temperature in the requested unit.",
    messages=[
        {"role": "user", "content": "What's the weather in Tokyo?"}
    ],
    tools=tools
)

print(f"Input tokens (with system + tools): {count_response.input_tokens}")

Tool definitions can be surprisingly token-heavy — each tool adds its name, description, and schema as input tokens. If you have many tools, count the full request before caching patterns to understand the fixed cost per call.

Context limit check before sending

MODEL_CONTEXT_LIMITS = {
    "claude-opus-4-6": 200_000,
    "claude-sonnet-4-6": 200_000,
    "claude-sonnet-4-5": 200_000,
    "claude-haiku-4-5": 200_000,
}

def fits_context(messages: list[dict], model: str, system: str = None,
                 max_output_tokens: int = 4096) -> bool:
    """Check if the request fits within the model's context window."""
    kwargs = {"model": model, "messages": messages}
    if system:
        kwargs["system"] = system

    count = client.messages.count_tokens(**kwargs)
    limit = MODEL_CONTEXT_LIMITS.get(model, 200_000)

    total = count.input_tokens + max_output_tokens
    if total > limit:
        print(f"Over limit: {count.input_tokens} input + {max_output_tokens} output "
              f"= {total} > {limit}")
        return False
    return True

All current Claude models have a 200K token context window. The 200K limit covers input + output combined — if you're using max_tokens=8192, you have 191,808 tokens for input.

Cost estimation

# May 2026 pricing (per MTok)
PRICING = {
    "claude-opus-4-6":    {"input": 15.00, "output": 75.00},
    "claude-sonnet-4-6":  {"input":  3.00, "output": 15.00},
    "claude-sonnet-4-5":  {"input":  3.00, "output": 15.00},
    "claude-haiku-4-5":   {"input":  0.80, "output":  4.00},
}

def estimate_cost(messages: list[dict], model: str, system: str = None,
                  expected_output_tokens: int = 500) -> float:
    """Estimate cost in USD before sending."""
    kwargs = {"model": model, "messages": messages}
    if system:
        kwargs["system"] = system

    count = client.messages.count_tokens(**kwargs)
    p = PRICING.get(model, PRICING["claude-sonnet-4-5"])

    input_cost = (count.input_tokens / 1_000_000) * p["input"]
    output_cost = (expected_output_tokens / 1_000_000) * p["output"]
    total = input_cost + output_cost

    print(f"Input: {count.input_tokens} tokens (${input_cost:.5f})")
    print(f"Output: ~{expected_output_tokens} tokens (${output_cost:.5f})")
    print(f"Estimated total: ${total:.5f}")
    return total

estimate_cost(
    messages=[{"role": "user", "content": "Summarize this 50-page document..."}],
    model="claude-sonnet-4-5",
    expected_output_tokens=1000
)

Tracking actual usage

After a successful messages.create() call, the actual usage is in response.usage:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)

print(f"Actual input tokens: {response.usage.input_tokens}")
print(f"Actual output tokens: {response.usage.output_tokens}")

# If prompt caching is active:
# response.usage.cache_read_input_tokens  — tokens served from cache
# response.usage.cache_creation_input_tokens  — tokens written to cache

Use response.usage for billing reconciliation — it's the authoritative count. Use count_tokens() for pre-flight checks and cost estimation. The two should match within a few tokens for typical requests.

count_tokens vs tiktoken (OpenAI comparison)

Aspect	Anthropic count_tokens()	OpenAI tiktoken
How it works	Live API call	Local library (offline)
Accuracy	Exact server-side count	Approximate (±1-2 tokens)
Speed	~100-300ms round-trip	Milliseconds (local)
Includes system/tool overhead	Yes, natively	Requires manual formula
Cost	Free (no charge for count calls)	Free (local)

There is no official offline tokenizer for Claude — Anthropic uses a custom BPE tokenizer that isn't publicly released. The count_tokens() API is the recommended approach. For high-throughput pre-flight checks where latency matters, cache the token count for repeated identical prompts.

FAQ

Does count_tokens include system prompt tokens? Yes — pass all the same parameters you plan to use in messages.create() for an accurate count.

Is count_tokens free? Yes — Anthropic doesn't charge for token counting calls.

count_tokens vs tiktoken? count_tokens is a live API call that returns an exact count. tiktoken is a local library that approximates. For Claude, count_tokens is the only accurate option since there's no public offline tokenizer.

Last updated May 28, 2026. Code examples verified against the Anthropic Python SDK and token counting documentation. API behaviour and pricing may change — confirm against the official docs before deploying to production.