What audio formats does the OpenAI Whisper API support?

The Whisper API supports mp3, mp4, mpeg, mpga, m4a, wav, and webm. The maximum file size is 25 MB. For audio longer than about 15 minutes or larger than 25 MB, split the file into chunks before sending — there is no streaming transcription endpoint in the standard Whisper API.

How do I get word-level timestamps from the Whisper API?

Set response_format to 'verbose_json' and set timestamp_granularities to ['word'] or ['segment'] or both. Word-level timestamps return a 'words' list in the response, each with 'word', 'start', and 'end' keys in seconds. Note that word timestamps require response_format='verbose_json' — they do not work with the plain 'json' or 'text' response formats.

What is the difference between transcriptions and translations in the Whisper API?

transcriptions.create returns the transcript in the same language as the audio. translations.create always returns an English transcript regardless of the audio language. Both endpoints accept the same parameters, but translations does not accept a 'language' parameter since the output language is always English. Use translations when you need to work with non-English audio in an English-language pipeline.

OpenAI Whisper API: Python transcription guide

The Whisper API is OpenAI's hosted speech-to-text service built on the open-source Whisper model. It accepts audio files and returns a transcript in the original language or translated to English. This guide covers the basic transcription call, response formats, word-level timestamps, language hints, the translation endpoint, and how to handle files larger than 25 MB.

The 30-second answer

Endpoint: client.audio.transcriptions.create(model="whisper-1", file=audio_file)
Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm. Max 25 MB.
Word timestamps: response_format="verbose_json" + timestamp_granularities=["word"].
Translation to English: use client.audio.translations.create() instead of transcriptions.

Basic transcription

Open the audio file in binary mode and pass it directly to the API. The response is a Transcription object with a .text attribute:

from openai import OpenAI

client = OpenAI()

with open("interview.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

print(transcript.text)

The model parameter is always "whisper-1" — it's the only hosted Whisper model available. Response time scales with file duration; a 1-minute file typically returns in 5–15 seconds.

Response formats

Control the output format with response_format:

# Plain text (string only)
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="text"
)
# transcript is a plain string

# JSON with just the text (default)
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="json"
)
# transcript.text is the string

# SRT subtitles
srt = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="srt"
)
# srt is a plain string in SRT format

# VTT subtitles
vtt = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="vtt"
)

# Verbose JSON (needed for timestamps)
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="verbose_json"
)
# transcript.text, transcript.segments, transcript.language, transcript.duration

For subtitle generation, srt and vtt return formatted strings ready to write to file. For programmatic post-processing, verbose_json gives you segments with start/end times.

Word-level timestamps

To get per-word timestamps, use verbose_json and set timestamp_granularities:

with open("meeting.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["word"]
    )

# transcript.words is a list of WordTimestamp objects
for word in transcript.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")
# 0.00s - 0.30s: Hello
# 0.30s - 0.55s: everyone
# ...

You can request both word and segment timestamps simultaneously: timestamp_granularities=["word", "segment"]. The segments list gives you sentence-level chunks with start/end times, which is useful for building subtitle files with better grouping than word-by-word.

Language hint

Providing the audio language improves accuracy and reduces latency — Whisper does not need to auto-detect it:

transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    language="fr"   # ISO-639-1 code
)

Use ISO 639-1 two-letter codes: "en", "fr", "de", "es", "ja", "zh", etc. Whisper supports 57 languages. If you omit language, the model detects it automatically — this works well but adds a small overhead.

Translation to English

The translations endpoint always outputs English, regardless of the audio language. Swap transcriptions for translations:

with open("french_interview.mp3", "rb") as audio_file:
    translation = client.audio.translations.create(
        model="whisper-1",
        file=audio_file
        # No 'language' parameter — output is always English
    )

print(translation.text)  # English translation

The translations endpoint accepts the same parameters as transcriptions, except language (which is ignored — output is always English). Use this when your pipeline is English-only and the audio may be in various languages.

Handling files over 25 MB

The API rejects files over 25 MB. For longer recordings, chunk the audio before sending:

from pydub import AudioSegment
import math

def transcribe_long_audio(filepath: str, chunk_minutes: int = 10) -> str:
    audio = AudioSegment.from_file(filepath)
    chunk_ms = chunk_minutes * 60 * 1000
    num_chunks = math.ceil(len(audio) / chunk_ms)

    full_transcript = []

    for i in range(num_chunks):
        chunk = audio[i * chunk_ms : (i + 1) * chunk_ms]
        chunk_path = f"/tmp/chunk_{i}.mp3"
        chunk.export(chunk_path, format="mp3")

        with open(chunk_path, "rb") as f:
            result = client.audio.transcriptions.create(
                model="whisper-1",
                file=f
            )
        full_transcript.append(result.text)

    return " ".join(full_transcript)

transcript = transcribe_long_audio("two_hour_meeting.mp3")

This requires pydub (pip install pydub) and ffmpeg installed locally. Chunk at natural boundaries (10–15 minutes) to minimise cut-off sentences at chunk edges. For more accurate joining, use verbose_json and reconstruct from segments with timestamps.

FAQ

What audio formats does the Whisper API support? mp3, mp4, mpeg, mpga, m4a, wav, webm. Max file size: 25 MB.

How do I get word-level timestamps? Use response_format="verbose_json" and timestamp_granularities=["word"]. The response includes a words list with start/end times per word.

Transcriptions vs translations? Transcriptions returns the audio in its original language. Translations always returns English output, regardless of the input language. Both use the same whisper-1 model.

Last updated May 28, 2026. Code examples verified against the OpenAI Python SDK v1.x and Whisper API documentation. API behaviour may change — confirm against the official docs before deploying to production.