POST
/
v1
/
audio
/
transcribe
curl --request POST \
  --url https://geoff.ai/api/v1/audio/transcribe \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
    "audio_url": "https://files.geoff.ai/audio/interview.mp3",
    "granularity": "word"
  }'
{
  "data": {
    "text": "Welcome to the show. Today we're talking about...",
    "has_speech": true,
    "language": "en",
    "duration_s": 124.3,
    "granularity": "word",
    "segments": [
      { "start_s": 0.0, "end_s": 30.0, "text": "Welcome to the show. Today we're..." }
    ],
    "words": [
      { "word": "Welcome", "start_s": 0.12, "end_s": 0.58 },
      { "word": "to", "start_s": 0.61, "end_s": 0.73 }
    ]
  },
  "trace_id": "04ede0ab069fb1ba8be5156a24b1e081",
  "extra_info": {
    "audio_megabytes": 2.4
  }
}
Speech-to-text with long-form support. Returns the transcript plus optional segment timestamps (default) or word-level timestamps. Long audio is segmented server-side and stitched in the response.

Authorization

Authorization
string
required
Bearer token. Bearer API_key.

Request Body

audio_url
string
required
URL of the audio to transcribe. Any common codec accepted (mp3 / wav / m4a / ogg / mp4); auto-converted to 16 kHz mono.
language
string
ISO 639-1 language hint. Omit for auto-detect.
granularity
string
segment (default) returns ~30 s buckets; word returns per-word start/end timestamps when the backing engine supports it.
translate
boolean
When true, returns the English translation alongside the original transcript. Default: false.
curl --request POST \
  --url https://geoff.ai/api/v1/audio/transcribe \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
    "audio_url": "https://files.geoff.ai/audio/interview.mp3",
    "granularity": "word"
  }'
{
  "data": {
    "text": "Welcome to the show. Today we're talking about...",
    "has_speech": true,
    "language": "en",
    "duration_s": 124.3,
    "granularity": "word",
    "segments": [
      { "start_s": 0.0, "end_s": 30.0, "text": "Welcome to the show. Today we're..." }
    ],
    "words": [
      { "word": "Welcome", "start_s": 0.12, "end_s": 0.58 },
      { "word": "to", "start_s": 0.61, "end_s": 0.73 }
    ]
  },
  "trace_id": "04ede0ab069fb1ba8be5156a24b1e081",
  "extra_info": {
    "audio_megabytes": 2.4
  }
}

Notes

  • has_speech: false indicates the engine detected no speech in the audio (e.g. instrumental music, silence). The text field will be empty in that case.
  • granularity: 'word' may fall back to segment-level timing when the backing engine doesn’t expose word timestamps; the response includes granularity_fallback with the reason.
  • translate: true keeps the original-language text and adds a translation field with the English version.