OpenAI context_length_exceeded (HTTP 400): what it means and how to fix it

If the OpenAI API returned HTTP 400 with "code": "context_length_exceeded", the problem is simple to state and easy to fix once you understand it: the total number of tokens in your request exceeds what the model can hold in its context window. This is an input-side problem — your prompt plus conversation history is too long — and the API rejects the request before generating a single output token. This page explains what's happening, how to count tokens accurately, and five strategies to fix it.

The 30-second answer

What the error looks like

The full error body from the API looks like this:

{
  "error": {
    "message": "This model's maximum context length is 128000 tokens.
However, your messages resulted in 134521 tokens.
Please reduce the length of the messages.",
    "type": "invalid_request_error",
    "param": "messages",
    "code": "context_length_exceeded"
  }
}

Key details in that body: the message field tells you exactly how many tokens your request used and what the model's limit is. Read it — it saves you from guessing. The type is invalid_request_error (the parent category for all malformed requests); the specific code is context_length_exceeded.

Context window sizes by model

The limit that matters is the total context window — input tokens plus output tokens combined. In practice, the constraint you hit is almost always input, since output tokens are capped separately by your max_tokens setting.

ModelContext windowNotes
gpt-4o128,000 tokensCurrent flagship; 128K as of 2026
gpt-4o-mini128,000 tokensSmaller/cheaper; same context window
gpt-4-turbo128,000 tokensLegacy; 128K introduced in late 2023
gpt-3.5-turbo16,385 tokensLegacy; significantly smaller window

OpenAI updates model specs regularly. Always verify the current limits in the OpenAI models documentation before hardcoding these numbers in your application logic.

How to count tokens accurately (before you send)

Word count is not token count. Tokens are subword units — "tokenization" splits into 3 tokens, punctuation often gets its own token, and code is tokenized differently from prose. The right tool is tiktoken, OpenAI's own tokenizer library:

pip install tiktoken
import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

# Check a single string
print(count_tokens("Hello, how can I help you today?"))  # 9 tokens

# For a chat completion, count all messages
def count_chat_tokens(messages: list, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    num_tokens = 0
    for message in messages:
        # ~4 tokens overhead per message (role, formatting)
        num_tokens += 4
        for key, value in message.items():
            num_tokens += len(enc.encode(str(value)))
    num_tokens += 2  # reply priming
    return num_tokens

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum entanglement in simple terms."},
]
print(count_chat_tokens(messages))  # check before sending

Run this count before every request in development. If you're close to the limit, any fix strategy below will work. If you're consistently at 90%+ of the limit, it's a design issue — build in a buffer of at least 10–15% for output tokens.

5 fix strategies

1. Truncate oldest messages from conversation history

The most common cause of this error in chat applications is unbounded history: every turn gets appended to the messages array, and eventually the cumulative total overflows. The fix is a sliding window — keep only the most recent N messages:

MAX_HISTORY_TOKENS = 100_000  # leave room for system prompt + output
SYSTEM_PROMPT = {"role": "system", "content": "You are a helpful assistant."}

def trim_history(messages: list, model: str = "gpt-4o") -> list:
    while count_chat_tokens([SYSTEM_PROMPT] + messages, model) > MAX_HISTORY_TOKENS:
        # Remove the oldest non-system message
        messages.pop(0)
    return messages

Always preserve the system prompt and remove from the oldest end. Dropping the most recent messages would confuse the model about the current topic.

2. Summarize earlier conversation turns

Truncation loses context. A smarter approach: periodically ask the model to summarize the conversation so far, replace the old history with the summary, and continue. This compresses many tokens of history into a few hundred while retaining the important information.

def summarize_history(messages: list, client) -> str:
    summary_request = messages + [{
        "role": "user",
        "content": "Summarize our conversation so far in 3-5 sentences, "
                   "preserving the key facts and decisions made."
    }]
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # use cheaper model for summarization
        messages=summary_request,
        max_tokens=300
    )
    return response.choices[0].message.content

3. Chunk long documents instead of sending them whole

If the input is a long document (a PDF, a codebase, a research paper), don't paste it all at once. Break it into chunks and either process each independently, or use retrieval — embed the document, find the relevant sections for the query, and send only those.

def chunk_text(text: str, chunk_size: int = 2000, overlap: int = 200) -> list:
    """Split text into overlapping chunks by approximate token count."""
    enc = tiktoken.encoding_for_model("gpt-4o")
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start += chunk_size - overlap
    return chunks

4. Reduce your system prompt size

System prompts are included in every request and count toward the context window every time. A 2,000-token system prompt in a 100-turn conversation contributes 200,000 tokens of context across the session. Audit your system prompt for repetition and remove anything the model can infer from context. Aim for under 500 tokens unless the task genuinely requires more.

5. Switch to a model with a larger context window

If you're on gpt-3.5-turbo (16K tokens), switching to gpt-4o or gpt-4o-mini (both 128K) resolves the limit immediately for most use cases. Note that this changes pricing — run the numbers before assuming it's a free fix.

context_length_exceeded vs. max_tokens: not the same problem

These two situations are frequently confused but have completely different causes and fixes:

SituationHTTP statusWhat happenedFix
context_length_exceeded400 (error)Your INPUT is too long; request rejected before generatingShorten the input (strategies above)
Hit max_tokens cap200 (success)Your OUTPUT was truncated by the cap you set; finish_reason: "length"Raise or remove your max_tokens parameter

context_length_exceeded is a hard failure — the API returns nothing. Hitting max_tokens is a successful (but truncated) response. Check response.choices[0].finish_reason: if it's "stop", generation finished naturally; if it's "length", you hit the max_tokens cap.

FAQ

What does OpenAI context_length_exceeded mean? An HTTP 400 error with code context_length_exceeded means the total tokens in your request — system prompt plus all messages in the conversation history plus any tool definitions — exceeds the model's maximum context window. It has nothing to do with your API key, rate limits, or billing. You need to reduce the number of input tokens before the request will succeed.

What is the difference between context_length_exceeded and hitting max_tokens? context_length_exceeded is about INPUT tokens: your prompt is too long and the API rejects the request with a 400 error before generating anything. max_tokens is a parameter you set to cap OUTPUT length: when reached, the model stops generating but the request succeeds with a finish_reason of "length". They are different problems. context_length_exceeded = shorten your input. max_tokens truncation = raise or remove the max_tokens cap.

How do I count tokens before sending a request to the OpenAI API? Use the tiktoken library. Install it with pip install tiktoken, then call tiktoken.encoding_for_model("gpt-4o") to get the encoder, and enc.encode(your_text) to tokenize. The length of the resulting list is the token count. For chat completions, count each message's content plus a small overhead per message (roughly 4 tokens each).

Last updated May 28, 2026. Context window sizes and error codes verified against OpenAI's official API documentation and model specs. OpenAI updates model context limits over time — confirm current limits at platform.openai.com/docs/models before hardcoding them in production.