How do I count tokens for GPT-4o messages in Python?

Use tiktoken: import tiktoken; enc = tiktoken.encoding_for_model('gpt-4o'); tokens = enc.encode(text); count = len(tokens). For message arrays (not just strings), each message adds overhead tokens for the role and separators — typically 4 tokens per message plus 3 tokens for the reply primer. The exact overhead varies slightly by model version.

Which tiktoken encoding does GPT-4o use?

GPT-4o and all GPT-4 family models use the cl100k_base encoding. GPT-3.5-turbo also uses cl100k_base. The older GPT-3 models (text-davinci-003 etc.) used p50k_base or r50k_base, but those models are deprecated. For any current OpenAI chat model, use tiktoken.encoding_for_model(model_name) and it will return the correct encoding automatically.

Does tiktoken count tokens the same way OpenAI does?

For text content, tiktoken is accurate. For message formatting overhead, tiktoken matches OpenAI's documented formula (4 tokens per message + 3 for the reply primer) but this is approximate — the actual server-side count may differ by 1-2 tokens in edge cases. For billing purposes, the count shown in API response usage fields is authoritative. Use tiktoken for pre-flight checks and cost estimates; don't rely on it for exact billing reconciliation.

OpenAI token counting with tiktoken: Python guide

tiktoken is OpenAI's official Python tokenizer library. It counts tokens before you make an API call, letting you check context limits, estimate costs, and truncate inputs programmatically. This guide covers the basic encode/count flow, message-level counting for chat endpoints, encoding selection by model, and how to truncate text to a token budget.

The 30-second answer

Install: pip install tiktoken
Count text tokens: enc = tiktoken.encoding_for_model("gpt-4o"); n = len(enc.encode(text))
All GPT-4o and GPT-3.5-turbo models use the cl100k_base encoding.
Message overhead: ~4 tokens per message + 3 for the reply primer — add to your content token count for accurate totals.

Basic token counting

import tiktoken

# Get the encoding for a specific model
enc = tiktoken.encoding_for_model("gpt-4o")

text = "How many tokens is this sentence?"
tokens = enc.encode(text)

print(f"Token count: {len(tokens)}")   # 7
print(f"Token IDs: {tokens}")           # [4438, 1690, 11460, 374, 420, 11914, 30]

The encode() method returns a list of integer token IDs. The length of that list is the token count. Most English text tokenizes at roughly 0.75 words per token (or ~4 characters per token), though this varies significantly by content — code and technical text tend to tokenize more efficiently than natural language.

Encoding selection by model

Different models use different encodings. encoding_for_model() handles this automatically:

import tiktoken

# Current models — all use cl100k_base
for model in ["gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "gpt-3.5-turbo"]:
    enc = tiktoken.encoding_for_model(model)
    print(f"{model}: {enc.name}")
    # All print: cl100k_base

# Or get the encoding directly by name
enc = tiktoken.get_encoding("cl100k_base")

# Older / custom models — get_encoding with explicit name
enc_o200k = tiktoken.get_encoding("o200k_base")  # GPT-4o in some configurations

Use encoding_for_model() rather than hardcoding the encoding name — it will automatically return the correct encoding as models change. For models not in tiktoken's registry, fall back to cl100k_base for any recent OpenAI model.

Counting tokens for chat message arrays

The raw text token count is not the full story for chat endpoints — each message adds formatting overhead. OpenAI's documented formula adds 4 tokens per message plus 3 for the reply primer:

import tiktoken

def count_message_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    """Count tokens for a messages array as sent to chat.completions.create()."""
    enc = tiktoken.encoding_for_model(model)

    num_tokens = 3  # Reply primer: every reply is primed with <|start|>assistant<|message|>

    for message in messages:
        num_tokens += 4  # role, content, name separator tokens per message

        for key, value in message.items():
            num_tokens += len(enc.encode(str(value)))

            if key == "name":
                num_tokens -= 1  # 'name' field saves 1 token vs 'role'

    return num_tokens

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

count = count_message_tokens(messages)
print(f"Total input tokens: {count}")  # ~26

This is an approximation — the actual count from the API (in response.usage.prompt_tokens) is authoritative for billing. For pre-flight checks and context management, this formula is accurate enough to avoid hitting context limits.

Truncating text to a token budget

Encode, slice, then decode back to a string:

def truncate_to_tokens(text: str, max_tokens: int, model: str = "gpt-4o") -> str:
    """Truncate text to at most max_tokens tokens."""
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)

    if len(tokens) <= max_tokens:
        return text

    # Slice to budget and decode
    truncated_tokens = tokens[:max_tokens]
    return enc.decode(truncated_tokens)

long_text = "..." * 10000  # some long document
truncated = truncate_to_tokens(long_text, max_tokens=4096)
print(f"Truncated to {len(tiktoken.encoding_for_model('gpt-4o').encode(truncated))} tokens")

Decoding truncated tokens can produce broken UTF-8 at boundaries — if you see garbled characters at the end of truncated text, trim the last few bytes or work at sentence boundaries instead. For document summarization pipelines, truncating at sentence boundaries is cleaner than a hard token cut.

Pre-flight context limit check

Prevent context_length_exceeded errors by checking before you call:

CONTEXT_LIMITS = {
    "gpt-4o": 128_000,
    "gpt-4o-mini": 128_000,
    "gpt-4-turbo": 128_000,
    "gpt-3.5-turbo": 16_385,
}

def check_fits_context(messages: list[dict], model: str, max_completion_tokens: int = 2048) -> bool:
    """Return True if the messages fit within the model's context window."""
    input_tokens = count_message_tokens(messages, model)
    limit = CONTEXT_LIMITS.get(model, 128_000)
    fits = (input_tokens + max_completion_tokens) <= limit

    if not fits:
        print(f"Too long: {input_tokens} input + {max_completion_tokens} completion = "
              f"{input_tokens + max_completion_tokens} > {limit} limit")
    return fits

Cost estimation

# May 2026 pricing (per MTok)
PRICING = {
    "gpt-4o":       {"input": 2.50, "output": 10.00},
    "gpt-4o-mini":  {"input": 0.15, "output":  0.60},
}

def estimate_cost(input_tokens: int, output_tokens: int, model: str) -> float:
    p = PRICING.get(model, PRICING["gpt-4o"])
    cost = (input_tokens / 1_000_000) * p["input"]
    cost += (output_tokens / 1_000_000) * p["output"]
    return cost

cost = estimate_cost(input_tokens=5000, output_tokens=500, model="gpt-4o")
print(f"Estimated cost: ${cost:.5f}")  # ~$0.01750

Cost estimation is useful for batch jobs — before running 10,000 calls, estimate total token usage and multiply by the per-token rate to verify the job is within budget. Always check the OpenAI pricing page for current rates; they change.

FAQ

Which encoding does GPT-4o use? cl100k_base. All current GPT-4 family and GPT-3.5-turbo models use this encoding. Use tiktoken.encoding_for_model(model_name) to get the right encoding automatically.

How accurate is tiktoken vs the actual API count? Accurate for text. The message overhead formula (4 tokens per message + 3 primer) is approximate — actual API count may differ by 1–2 tokens. For billing reconciliation, use response.usage from the API response.

How do I truncate text to a token budget? Encode to token IDs, slice the list, then decode. Handle potential UTF-8 boundary issues at the cut point.

Last updated May 28, 2026. Code examples verified against tiktoken and the OpenAI Python SDK v1.x. Token counts and model context limits may change — verify against the official OpenAI documentation.