What is the cheapest OpenAI model for most tasks?

GPT-4o-mini is the best cost-to-quality choice for most tasks that don't require complex reasoning or deep world knowledge. At $0.15 per million input tokens and $0.60 per million output tokens (2026 pricing), it is roughly 15–20x cheaper than GPT-4o for typical prompts while being capable enough for classification, extraction, summarization, and simple Q&A.

How does the OpenAI Batch API reduce costs?

The OpenAI Batch API processes requests asynchronously in off-peak windows and returns results within 24 hours. In exchange for that latency, OpenAI offers a 50% discount on input and output token costs. For any workload that doesn't require a real-time response — bulk classification, dataset annotation, report generation, nightly processing — the Batch API cuts your token bill in half with no change to output quality.

Does caching work with the OpenAI API?

OpenAI offers prompt caching automatically for inputs exceeding 1,024 tokens. When the same prompt prefix appears in subsequent requests within a session window, the cached tokens are charged at a 50% discount. You do not need to opt in — caching applies automatically when the prefix match threshold is met. For application-level caching of identical queries, implement it at your layer using a key-value store before the request even reaches the API.

How do I set a hard spending limit on the OpenAI API?

Go to platform.openai.com/account/limits. You can set a hard monthly usage limit that stops all API requests once the limit is reached, preventing runaway spend from bugs or abuse. Set both a soft limit (email alert) and a hard limit (API cutoff). For production systems, the hard limit should be set at 2–3x your expected monthly cost, not at your absolute maximum budget.

How to reduce your OpenAI API bill: 7 levers ranked by ROI

If your OpenAI API costs are higher than expected, there are seven distinct levers you can pull — and they're not equally effective. Model selection alone can cut costs by 90% for the right workloads. Prompt caching and the Batch API together can halve what's left. This page covers all seven in order of impact, with 2026 pricing data and a cost comparison table you can use to make the tradeoff decisions for your specific workload.

The 30-second answer

Biggest win: switch to gpt-4o-mini wherever GPT-4o isn't actually needed — it's 15–20x cheaper.
Free 50% discount: use the Batch API for any non-real-time workload.
Prompt caching: OpenAI automatically caches repeated prompt prefixes over 1,024 tokens at 50% off — structure your prompts to maximize prefix reuse.
Set a hard limit: before optimizing anything else, set a spending cap at platform.openai.com/account/limits so a bug can't run up an unlimited bill.

2026 pricing comparison

Current OpenAI pricing for the three most commonly used chat models (per million tokens):

Model	Input ($/M tokens)	Output ($/M tokens)	Context window	Best for
GPT-4o	$2.50	$10.00	128K	Complex reasoning, nuanced generation, multimodal tasks
GPT-4o-mini	$0.15	$0.60	128K	Classification, extraction, summarization, simple Q&A, code
GPT-3.5-turbo	$0.50	$1.50	16K	Legacy workloads; GPT-4o-mini is better value at lower cost
o3-mini	$1.10	$4.40	128K	Structured reasoning tasks where thinking time matters

Prices are standard (non-cached, non-batch) as of June 2026. Batch API applies a 50% discount on top of these rates. Prompt caching applies a 50% discount on cached input tokens for eligible prompts.

Lever 1: Switch to GPT-4o-mini for appropriate tasks

Potential savings: 80–95% of token cost on eligible workloads

GPT-4o-mini costs $0.15/M input and $0.60/M output tokens — roughly 16x cheaper than GPT-4o on input and 17x cheaper on output. For a huge class of tasks, the quality difference is immaterial:

Classification and labeling
Entity extraction
Summarization of structured content
Simple Q&A over provided context (RAG responses)
Code generation for well-specified problems
Translation
Form filling and data structuring

The correct process: identify which of your API calls actually need GPT-4o's capabilities, and move everything else to GPT-4o-mini. Run evals on a sample of your existing outputs to compare quality — in most product contexts, users won't notice the difference on extraction and summarization tasks.

Note: GPT-3.5-turbo is no longer the cost-efficient default. GPT-4o-mini is cheaper on input tokens, has a much larger context window, and produces better output. If you're still on GPT-3.5-turbo, migrating to GPT-4o-mini is a quality upgrade and a cost reduction simultaneously.

Lever 2: Cache responses for repeated queries

Potential savings: up to 100% of cost on repeated identical queries

There are two layers of caching available:

Application-level caching (semantic or exact-match): if the same user query or prompt appears more than once, don't call the API again. Cache the response in Redis, Memcached, or a key-value store keyed on the prompt hash. For a search autocomplete feature or an FAQ bot, this can eliminate 50–80% of API calls entirely.

OpenAI prompt caching (automatic): OpenAI automatically caches prompt prefixes for prompts over 1,024 tokens. When a subsequent request shares the same prefix (e.g., a long system prompt followed by different user messages), the cached input tokens are charged at 50% off. You don't need to opt in — structure your prompts to put the stable content first (system prompt, documents, context) and the variable content at the end to maximize cache hits.

# Maximize prompt cache hits: stable content first, variable last
messages = [
    {"role": "system", "content": LONG_STABLE_SYSTEM_PROMPT},  # cached
    {"role": "user", "content": user_query}  # variable — goes last
]

Lever 3: Reduce max_tokens to match actual output needs

Potential savings: 20–60% of output token cost

You are charged for tokens generated, up to your max_tokens limit. If your application needs a one-sentence answer, setting max_tokens=1024 doesn't cost you 1024 tokens — it costs you however many tokens the model actually generates. But there's a subtler problem: the model will often fill the space you give it. If you set max_tokens=500 for a task that should produce 50 tokens, the model frequently pads its output to use the allocated space.

Set max_tokens to the shortest output your task legitimately requires, not a generous upper bound:

# Bad: allocating 1024 tokens for a yes/no classification
response = client.chat.completions.create(
    model="gpt-4o-mini",
    max_tokens=1024,  # wasteful — model will pad
    messages=[...]
)

# Good: constrain to actual need
response = client.chat.completions.create(
    model="gpt-4o-mini",
    max_tokens=10,  # enough for "yes", "no", or "uncertain"
    messages=[...]
)

For structured outputs (JSON extraction, classification, etc.), use response_format={"type": "json_object"} to constrain output shape, which also reduces token waste from conversational padding.

Lever 4: Trim system prompts — every token counts

Potential savings: 10–40% of input token cost per request

System prompts are sent with every request. A 500-token system prompt running at 10,000 requests/day costs 5M input tokens/day — at GPT-4o pricing, that's $12.50/day or $4,500/year just for the system prompt. Every token you cut from it multiplies across all requests.

Audit your system prompt for:

Redundancy: instructions repeated in multiple places
Overclaiming: verbose preambles ("You are a helpful, harmless, and honest assistant who always tries to...") that can be compressed to one line
Examples that could be removed: few-shot examples in the system prompt are expensive; move them to a retrieval system or eliminate them if the model handles the task without them
Filler language: "Please make sure to..." and "It is important that you..." are often redundant with the actual instruction

Use the OpenAI tokenizer to count your system prompt tokens before and after trimming.

Lever 5: Use streaming to detect early stop conditions

Potential savings: 10–30% of output token cost for variable-length outputs

With streaming enabled, you receive tokens as they're generated and can stop the stream once a condition is met — for example, once you've extracted the JSON object you need, or once the response has exceeded a quality threshold. Non-streaming mode waits for full completion, then returns the whole thing.

with client.chat.completions.stream(
    model="gpt-4o-mini",
    messages=[...],
) as stream:
    buffer = ""
    for text in stream.text_stream:
        buffer += text
        # Stop as soon as we have a complete JSON object
        if buffer.strip().endswith("}"):
            stream.close()
            break

This pattern is most useful for extraction tasks where the useful content appears before the end of the generated text, or for long-form generation where a validity check can short-circuit unnecessary continuation. It requires more complex client-side code than non-streaming, so reserve it for high-volume routes where the savings justify the complexity.

Lever 6: Use the Batch API for non-real-time workloads

Potential savings: 50% on both input and output tokens

OpenAI's Batch API processes requests asynchronously within a 24-hour window, returning results when complete. In exchange for this latency, every token — input and output — is charged at 50% of the standard rate. The quality is identical to the synchronous API; you're just trading response time for cost.

The Batch API is the right choice for:

Nightly data processing pipelines
Bulk content generation or annotation
Offline analysis runs
Backfill jobs on historical data
Any task where the user isn't waiting for an immediate response

from openai import OpenAI
import json

client = OpenAI()

# Prepare a JSONL batch file
requests = [
    {"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
     "body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": text}]}}
    for i, text in enumerate(texts_to_process)
]

# Upload and submit
batch_file = client.files.create(
    file=("\n".join(json.dumps(r) for r in requests)).encode(),
    purpose="batch"
)
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)
# Poll batch.id for status, retrieve results when complete

For any workload that's currently running nightly or in a queue, switching to the Batch API cuts the token cost in half with no code complexity tradeoff.

Lever 7: Monitor usage and set hard limits

Potential savings: prevents runaway spend from bugs, infinite loops, or abuse

The usage dashboard at platform.openai.com/usage shows token consumption by day and by model. Check it weekly when optimizing — you'll often find one endpoint or one model consuming a disproportionate share of spend that isn't obvious from application metrics alone.

Two controls to set before any other optimization:

Soft limit (email alert): triggers a notification when you reach a threshold. Set it at your expected monthly spend so surprises surface early.
Hard limit (API cutoff): stops all API requests when reached. Set it at 2–3x your expected spend — enough buffer that normal usage variation doesn't hit it, tight enough to cap a bug-driven spike.

Both are set at platform.openai.com/account/limits. A hard limit at $50 when your expected bill is $20 is not paranoid — it's the only protection against a runaway loop, a compromised API key, or an unexpected traffic spike driving an unbounded bill.

Also consider logging the usage field from every API response in your application:

response = client.chat.completions.create(...)
print(response.usage.prompt_tokens, response.usage.completion_tokens, response.usage.total_tokens)

Aggregating this in your own database gives you per-feature, per-user, or per-route cost breakdowns that the OpenAI dashboard alone can't provide.

Cost reduction by use case: typical impact

Use case	Default model	Optimized approach	Estimated cost reduction
Text classification	GPT-4o	GPT-4o-mini + Batch API	~96%
RAG Q&A (with long context)	GPT-4o	GPT-4o-mini + prompt cache	~85%
Summarization	GPT-4o	GPT-4o-mini + max_tokens trim	~90%
Code generation	GPT-4o	GPT-4o-mini (simple) / GPT-4o (complex)	~50–80%
Entity extraction (JSON)	GPT-4o	GPT-4o-mini + json_object mode + Batch	~96%
Complex reasoning / analysis	GPT-4o	GPT-4o + Batch API (non-real-time)	~50%

FAQ

Is GPT-4o-mini accurate enough for production use? For well-defined tasks with clear prompts, yes. It scores within a few points of GPT-4o on standard benchmarks for classification, extraction, and summarization. The gap is most pronounced for open-ended reasoning, ambiguous instructions, and tasks that require drawing on world knowledge. Run an eval on your specific task before committing.

Does prompt caching happen automatically? Yes. OpenAI's prompt caching applies automatically to eligible requests (1,024+ token prompts). You'll see it reflected in the usage.prompt_tokens_details.cached_tokens field of the response object, and the discounted rate appears in your invoice line items.

Can I use all 7 levers together? Yes and they stack multiplicatively. Switching from GPT-4o to GPT-4o-mini (17x cheaper) combined with the Batch API (2x cheaper) gives you roughly 34x cost reduction on eligible workloads — a $340/month bill becomes approximately $10/month.

Last updated June 2, 2026. Pricing figures are from OpenAI's published rates as of June 2026 and are subject to change. Verify current pricing at platform.openai.com/docs/pricing before using these figures for budget planning.