Speech to Text
Transcribe Audio
Convert speech audio to text with optional segment- or word-level timestamps. Auto-detects language.
POST
Speech-to-text with long-form support. Returns the transcript plus
optional segment timestamps (default) or word-level timestamps. Long
audio is segmented server-side and stitched in the response.
Authorization
Bearer token.
Bearer API_key.Request Body
URL of the audio to transcribe. Any common codec accepted
(mp3 / wav / m4a / ogg / mp4); auto-converted to 16 kHz mono.
ISO 639-1 language hint. Omit for auto-detect.
segment (default) returns ~30 s buckets; word returns per-word
start/end timestamps when the backing engine supports it.When
true, returns the English translation alongside the original
transcript. Default: false.Notes
has_speech: falseindicates the engine detected no speech in the audio (e.g. instrumental music, silence). Thetextfield will be empty in that case.granularity: 'word'may fall back to segment-level timing when the backing engine doesn’t expose word timestamps; the response includesgranularity_fallbackwith the reason.translate: truekeeps the original-languagetextand adds atranslationfield with the English version.