POST
/
v1
/
voice
/
extend
curl --request POST \
  --url https://geoff.ai/api/v1/voice/extend \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
    "reference_audio_url": "https://files.geoff.ai/audio/intro_segment.wav",
    "text": "And now, back to the main story.",
    "format": "wav"
  }'
{
  "data": {
    "audio_url": "https://files.geoff.ai/output/extended_abc123.wav",
    "audio_b64": "...",
    "format": "wav",
    "sample_rate": 24000,
    "duration_s": 3.4
  },
  "trace_id": "04ede0ab069fb1ba8be5156a24b1e081"
}
Continue an audio clip in the same voice — passes a short reference clip plus the new text and returns audio that sounds like the original speaker speaking the new content. Lower-friction than cloning + synthesizing in two calls when the goal is a single contiguous-feeling clip.

Authorization

Authorization
string
required
Bearer token. Bearer API_key.

Request Body

reference_audio_url
string
required
Source audio URL providing voice + style. 5–30 seconds is the sweet spot.
text
string
required
New text to render in the source voice.
format
string
Output audio format. Options: wav, mp3. Default: wav.
curl --request POST \
  --url https://geoff.ai/api/v1/voice/extend \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
    "reference_audio_url": "https://files.geoff.ai/audio/intro_segment.wav",
    "text": "And now, back to the main story.",
    "format": "wav"
  }'
{
  "data": {
    "audio_url": "https://files.geoff.ai/output/extended_abc123.wav",
    "audio_b64": "...",
    "format": "wav",
    "sample_rate": 24000,
    "duration_s": 3.4
  },
  "trace_id": "04ede0ab069fb1ba8be5156a24b1e081"
}

When to use what

GoalTool
Continue an existing clip in the same voicevoice_extend (this endpoint)
Render arbitrary text in a saved catalog voiceT2A with voice_id
Quick single-clip clone + render in one callOne Shot Voice
Persist a voice for repeated useVoice Clone → T2A

Tips

  • Reference length matters: longer reference clips capture more of the speaker’s prosody. 15+ seconds substantially improves long-form fidelity over 5-second references.
  • Prosody fidelity: voice extend preserves the cadence and emotional register of the reference — useful for podcast-style continuations where consistency matters more than literal word-for-word voice match.