Voice Cloning
One Shot Voice
Render text in the voice of a supplied reference clip in a single call. Combines voice cloning and speech synthesis with no intermediate voice_id.
POST
Render text in the voice of a supplied reference clip in a single
call. Internally clones the reference and synthesizes the text in
that voice — equivalent to chaining
voice clone and
T2A but without threading a
voice_id between requests.
Best for one-off “say this in this voice” interactions where the
caller doesn’t need to persist the cloned voice for reuse.
Authorization
Bearer token.
Bearer API_key.Request Body
Text to render in the cloned voice. Long inputs are chunked
automatically.
URL of the reference clip (3–10 s recommended; mp3 or wav).
Mutually exclusive with
reference_audio_b64.Base64-encoded reference audio. Use when uploading directly
without a URL.
Optional human-readable name for the cloned voice. A timestamped
name is generated when omitted.
ISO 639-1 language code. Default:
en.Advanced: the exact words spoken in the reference clip. Auto-
detected via transcription when omitted.
Output audio format. Options:
wav, pcm16. Default: wav.Audio sample rate in Hz. Options:
16000, 22050, 24000,
44100, 48000. Default: 24000.When to use what
| Goal | Tool |
|---|---|
| One-off “say this in this voice” | one_shot_voice (this endpoint) |
| Render many lines in the same voice | Voice Clone → T2A |
| Brand voice that needs distribution-quality consistency | Voice Model Training → T2A |
| No reference clip available; design from parameters | Voice Design |