OpenAI Whisper API: Python transcription guide

The Whisper API is OpenAI's hosted speech-to-text service built on the open-source Whisper model. It accepts audio files and returns a transcript in the original language or translated to English. This guide covers the basic transcription call, response formats, word-level timestamps, language hints, the translation endpoint, and how to handle files larger than 25 MB.

The 30-second answer

Basic transcription

Open the audio file in binary mode and pass it directly to the API. The response is a Transcription object with a .text attribute:

from openai import OpenAI

client = OpenAI()

with open("interview.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

print(transcript.text)

The model parameter is always "whisper-1" — it's the only hosted Whisper model available. Response time scales with file duration; a 1-minute file typically returns in 5–15 seconds.

Response formats

Control the output format with response_format:

# Plain text (string only)
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="text"
)
# transcript is a plain string

# JSON with just the text (default)
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="json"
)
# transcript.text is the string

# SRT subtitles
srt = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="srt"
)
# srt is a plain string in SRT format

# VTT subtitles
vtt = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="vtt"
)

# Verbose JSON (needed for timestamps)
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="verbose_json"
)
# transcript.text, transcript.segments, transcript.language, transcript.duration

For subtitle generation, srt and vtt return formatted strings ready to write to file. For programmatic post-processing, verbose_json gives you segments with start/end times.

Word-level timestamps

To get per-word timestamps, use verbose_json and set timestamp_granularities:

with open("meeting.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["word"]
    )

# transcript.words is a list of WordTimestamp objects
for word in transcript.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")
# 0.00s - 0.30s: Hello
# 0.30s - 0.55s: everyone
# ...

You can request both word and segment timestamps simultaneously: timestamp_granularities=["word", "segment"]. The segments list gives you sentence-level chunks with start/end times, which is useful for building subtitle files with better grouping than word-by-word.

Language hint

Providing the audio language improves accuracy and reduces latency — Whisper does not need to auto-detect it:

transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="fr"   # ISO-639-1 code
)

Use ISO 639-1 two-letter codes: "en", "fr", "de", "es", "ja", "zh", etc. Whisper supports 57 languages. If you omit language, the model detects it automatically — this works well but adds a small overhead.

Translation to English

The translations endpoint always outputs English, regardless of the audio language. Swap transcriptions for translations:

with open("french_interview.mp3", "rb") as audio_file:
    translation = client.audio.translations.create(
        model="whisper-1",
        file=audio_file
        # No 'language' parameter — output is always English
    )

print(translation.text)  # English translation

The translations endpoint accepts the same parameters as transcriptions, except language (which is ignored — output is always English). Use this when your pipeline is English-only and the audio may be in various languages.

Handling files over 25 MB

The API rejects files over 25 MB. For longer recordings, chunk the audio before sending:

from pydub import AudioSegment
import math

def transcribe_long_audio(filepath: str, chunk_minutes: int = 10) -> str:
    audio = AudioSegment.from_file(filepath)
    chunk_ms = chunk_minutes * 60 * 1000
    num_chunks = math.ceil(len(audio) / chunk_ms)

    full_transcript = []

    for i in range(num_chunks):
        chunk = audio[i * chunk_ms : (i + 1) * chunk_ms]
        chunk_path = f"/tmp/chunk_{i}.mp3"
        chunk.export(chunk_path, format="mp3")

        with open(chunk_path, "rb") as f:
            result = client.audio.transcriptions.create(
                model="whisper-1",
                file=f
            )
        full_transcript.append(result.text)

    return " ".join(full_transcript)

transcript = transcribe_long_audio("two_hour_meeting.mp3")

This requires pydub (pip install pydub) and ffmpeg installed locally. Chunk at natural boundaries (10–15 minutes) to minimise cut-off sentences at chunk edges. For more accurate joining, use verbose_json and reconstruct from segments with timestamps.

FAQ

What audio formats does the Whisper API support? mp3, mp4, mpeg, mpga, m4a, wav, webm. Max file size: 25 MB.

How do I get word-level timestamps? Use response_format="verbose_json" and timestamp_granularities=["word"]. The response includes a words list with start/end times per word.

Transcriptions vs translations? Transcriptions returns the audio in its original language. Translations always returns English output, regardless of the input language. Both use the same whisper-1 model.

Last updated May 28, 2026. Code examples verified against the OpenAI Python SDK v1.x and Whisper API documentation. API behaviour may change — confirm against the official docs before deploying to production.