OpenAI token counting with tiktoken: Python guide

tiktoken is OpenAI's official Python tokenizer library. It counts tokens before you make an API call, letting you check context limits, estimate costs, and truncate inputs programmatically. This guide covers the basic encode/count flow, message-level counting for chat endpoints, encoding selection by model, and how to truncate text to a token budget.

The 30-second answer

Basic token counting

import tiktoken

# Get the encoding for a specific model
enc = tiktoken.encoding_for_model("gpt-4o")

text = "How many tokens is this sentence?"
tokens = enc.encode(text)

print(f"Token count: {len(tokens)}")   # 7
print(f"Token IDs: {tokens}")           # [4438, 1690, 11460, 374, 420, 11914, 30]

The encode() method returns a list of integer token IDs. The length of that list is the token count. Most English text tokenizes at roughly 0.75 words per token (or ~4 characters per token), though this varies significantly by content — code and technical text tend to tokenize more efficiently than natural language.

Encoding selection by model

Different models use different encodings. encoding_for_model() handles this automatically:

import tiktoken

# Current models — all use cl100k_base
for model in ["gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "gpt-3.5-turbo"]:
    enc = tiktoken.encoding_for_model(model)
    print(f"{model}: {enc.name}")
    # All print: cl100k_base

# Or get the encoding directly by name
enc = tiktoken.get_encoding("cl100k_base")

# Older / custom models — get_encoding with explicit name
enc_o200k = tiktoken.get_encoding("o200k_base")  # GPT-4o in some configurations

Use encoding_for_model() rather than hardcoding the encoding name — it will automatically return the correct encoding as models change. For models not in tiktoken's registry, fall back to cl100k_base for any recent OpenAI model.

Counting tokens for chat message arrays

The raw text token count is not the full story for chat endpoints — each message adds formatting overhead. OpenAI's documented formula adds 4 tokens per message plus 3 for the reply primer:

import tiktoken

def count_message_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    """Count tokens for a messages array as sent to chat.completions.create()."""
    enc = tiktoken.encoding_for_model(model)

    num_tokens = 3  # Reply primer: every reply is primed with <|start|>assistant<|message|>

    for message in messages:
        num_tokens += 4  # role, content, name separator tokens per message

        for key, value in message.items():
            num_tokens += len(enc.encode(str(value)))

            if key == "name":
                num_tokens -= 1  # 'name' field saves 1 token vs 'role'

    return num_tokens

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

count = count_message_tokens(messages)
print(f"Total input tokens: {count}")  # ~26

This is an approximation — the actual count from the API (in response.usage.prompt_tokens) is authoritative for billing. For pre-flight checks and context management, this formula is accurate enough to avoid hitting context limits.

Truncating text to a token budget

Encode, slice, then decode back to a string:

def truncate_to_tokens(text: str, max_tokens: int, model: str = "gpt-4o") -> str:
    """Truncate text to at most max_tokens tokens."""
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)

    if len(tokens) <= max_tokens:
        return text

    # Slice to budget and decode
    truncated_tokens = tokens[:max_tokens]
    return enc.decode(truncated_tokens)

long_text = "..." * 10000  # some long document
truncated = truncate_to_tokens(long_text, max_tokens=4096)
print(f"Truncated to {len(tiktoken.encoding_for_model('gpt-4o').encode(truncated))} tokens")

Decoding truncated tokens can produce broken UTF-8 at boundaries — if you see garbled characters at the end of truncated text, trim the last few bytes or work at sentence boundaries instead. For document summarization pipelines, truncating at sentence boundaries is cleaner than a hard token cut.

Pre-flight context limit check

Prevent context_length_exceeded errors by checking before you call:

CONTEXT_LIMITS = {
    "gpt-4o": 128_000,
    "gpt-4o-mini": 128_000,
    "gpt-4-turbo": 128_000,
    "gpt-3.5-turbo": 16_385,
}

def check_fits_context(messages: list[dict], model: str, max_completion_tokens: int = 2048) -> bool:
    """Return True if the messages fit within the model's context window."""
    input_tokens = count_message_tokens(messages, model)
    limit = CONTEXT_LIMITS.get(model, 128_000)
    fits = (input_tokens + max_completion_tokens) <= limit

    if not fits:
        print(f"Too long: {input_tokens} input + {max_completion_tokens} completion = "
              f"{input_tokens + max_completion_tokens} > {limit} limit")
    return fits

Cost estimation

# May 2026 pricing (per MTok)
PRICING = {
    "gpt-4o":       {"input": 2.50, "output": 10.00},
    "gpt-4o-mini":  {"input": 0.15, "output":  0.60},
}

def estimate_cost(input_tokens: int, output_tokens: int, model: str) -> float:
    p = PRICING.get(model, PRICING["gpt-4o"])
    cost = (input_tokens / 1_000_000) * p["input"]
    cost += (output_tokens / 1_000_000) * p["output"]
    return cost

cost = estimate_cost(input_tokens=5000, output_tokens=500, model="gpt-4o")
print(f"Estimated cost: ${cost:.5f}")  # ~$0.01750

Cost estimation is useful for batch jobs — before running 10,000 calls, estimate total token usage and multiply by the per-token rate to verify the job is within budget. Always check the OpenAI pricing page for current rates; they change.

FAQ

Which encoding does GPT-4o use? cl100k_base. All current GPT-4 family and GPT-3.5-turbo models use this encoding. Use tiktoken.encoding_for_model(model_name) to get the right encoding automatically.

How accurate is tiktoken vs the actual API count? Accurate for text. The message overhead formula (4 tokens per message + 3 primer) is approximate — actual API count may differ by 1–2 tokens. For billing reconciliation, use response.usage from the API response.

How do I truncate text to a token budget? Encode to token IDs, slice the list, then decode. Handle potential UTF-8 boundary issues at the cut point.

Last updated May 28, 2026. Code examples verified against tiktoken and the OpenAI Python SDK v1.x. Token counts and model context limits may change — verify against the official OpenAI documentation.