OpenAI Whisper API: Python transcription guide
The Whisper API is OpenAI's hosted speech-to-text service built on the open-source Whisper model. It accepts audio files and returns a transcript in the original language or translated to English. This guide covers the basic transcription call, response formats, word-level timestamps, language hints, the translation endpoint, and how to handle files larger than 25 MB.
The 30-second answer
- Endpoint:
client.audio.transcriptions.create(model="whisper-1", file=audio_file) - Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm. Max 25 MB.
- Word timestamps:
response_format="verbose_json"+timestamp_granularities=["word"]. - Translation to English: use
client.audio.translations.create()instead oftranscriptions.
Basic transcription
Open the audio file in binary mode and pass it directly to the API. The response is a Transcription object with a .text attribute:
from openai import OpenAI
client = OpenAI()
with open("interview.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print(transcript.text)
The model parameter is always "whisper-1" — it's the only hosted Whisper model available. Response time scales with file duration; a 1-minute file typically returns in 5–15 seconds.
Response formats
Control the output format with response_format:
# Plain text (string only)
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
# transcript is a plain string
# JSON with just the text (default)
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="json"
)
# transcript.text is the string
# SRT subtitles
srt = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="srt"
)
# srt is a plain string in SRT format
# VTT subtitles
vtt = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="vtt"
)
# Verbose JSON (needed for timestamps)
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json"
)
# transcript.text, transcript.segments, transcript.language, transcript.duration
For subtitle generation, srt and vtt return formatted strings ready to write to file. For programmatic post-processing, verbose_json gives you segments with start/end times.
Word-level timestamps
To get per-word timestamps, use verbose_json and set timestamp_granularities:
with open("meeting.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word"]
)
# transcript.words is a list of WordTimestamp objects
for word in transcript.words:
print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")
# 0.00s - 0.30s: Hello
# 0.30s - 0.55s: everyone
# ...
You can request both word and segment timestamps simultaneously: timestamp_granularities=["word", "segment"]. The segments list gives you sentence-level chunks with start/end times, which is useful for building subtitle files with better grouping than word-by-word.
Language hint
Providing the audio language improves accuracy and reduces latency — Whisper does not need to auto-detect it:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language="fr" # ISO-639-1 code
)
Use ISO 639-1 two-letter codes: "en", "fr", "de", "es", "ja", "zh", etc. Whisper supports 57 languages. If you omit language, the model detects it automatically — this works well but adds a small overhead.
Translation to English
The translations endpoint always outputs English, regardless of the audio language. Swap transcriptions for translations:
with open("french_interview.mp3", "rb") as audio_file:
translation = client.audio.translations.create(
model="whisper-1",
file=audio_file
# No 'language' parameter — output is always English
)
print(translation.text) # English translation
The translations endpoint accepts the same parameters as transcriptions, except language (which is ignored — output is always English). Use this when your pipeline is English-only and the audio may be in various languages.
Handling files over 25 MB
The API rejects files over 25 MB. For longer recordings, chunk the audio before sending:
from pydub import AudioSegment
import math
def transcribe_long_audio(filepath: str, chunk_minutes: int = 10) -> str:
audio = AudioSegment.from_file(filepath)
chunk_ms = chunk_minutes * 60 * 1000
num_chunks = math.ceil(len(audio) / chunk_ms)
full_transcript = []
for i in range(num_chunks):
chunk = audio[i * chunk_ms : (i + 1) * chunk_ms]
chunk_path = f"/tmp/chunk_{i}.mp3"
chunk.export(chunk_path, format="mp3")
with open(chunk_path, "rb") as f:
result = client.audio.transcriptions.create(
model="whisper-1",
file=f
)
full_transcript.append(result.text)
return " ".join(full_transcript)
transcript = transcribe_long_audio("two_hour_meeting.mp3")
This requires pydub (pip install pydub) and ffmpeg installed locally. Chunk at natural boundaries (10–15 minutes) to minimise cut-off sentences at chunk edges. For more accurate joining, use verbose_json and reconstruct from segments with timestamps.
FAQ
What audio formats does the Whisper API support? mp3, mp4, mpeg, mpga, m4a, wav, webm. Max file size: 25 MB.
How do I get word-level timestamps? Use response_format="verbose_json" and timestamp_granularities=["word"]. The response includes a words list with start/end times per word.
Transcriptions vs translations? Transcriptions returns the audio in its original language. Translations always returns English output, regardless of the input language. Both use the same whisper-1 model.
Last updated May 28, 2026. Code examples verified against the OpenAI Python SDK v1.x and Whisper API documentation. API behaviour may change — confirm against the official docs before deploying to production.