Voice Design
Voice Design
Generate speech in a brand-new voice described by parameters — no reference audio required.
POST
Design a custom voice character by dialing in profile parameters
(gender, age, pitch, style, emotion, accent, dialect) and optionally
a free-form prose description, then render
Example:
prompt in that voice.
Useful when the caller wants a specific voice character but has no
audio sample to clone from.
Pass save_as to persist the designed voice for reuse — subsequent
calls can render arbitrary text in the saved voice via the standard
text-to-audio endpoint.
Authorization
Bearer token.
Bearer API_key, can be found in Settings > API Keys.Request Body
Text to render in the designed voice.
Free-form prose description of the voice — e.g.
"warm female radio host with a soft mid-Atlantic accent". Combined with the structured
slots below to compose the engine instruction.Voice gender. Options:
auto, male, female. Default: auto.Apparent speaker age. Options:
auto, child, teenager,
young-adult, middle-aged, elderly. Default: auto.Pitch register. Options:
auto, very-low, low, moderate,
high, very-high. Default: auto.Delivery style. Options:
auto, neutral, whisper,
authoritative, excited, calm, narrator, warm, cheerful.
Default: auto.Emotional tone. Options:
neutral, happy, sad, angry,
fearful, surprised, calm, excited. Default: neutral.Free-form accent label, e.g.
american, british, australian,
southern, scottish, indian.Optional regional dialect refinement, e.g.
cantonese,
sichuanese, andalusian, received-pronunciation.Playback speed multiplier. Range:
0.5 to 2.0. Default: 1.0.Sampling steps. Higher values trade latency for fidelity.
Range:
8 to 64. Default: 16.ISO 639-1 language code. Default:
en.Output audio format. Options:
wav, pcm16. Default: wav.Audio sample rate in Hz. Options:
16000, 22050, 24000,
44100, 48000. Default: 24000.Optional human-readable name. When supplied, the designed voice is
persisted to the catalog with a stable
voice_id so it can be
reused via the standard text-to-audio endpoint.Non-verbal effects
Inline tokens insideprompt are recognised as non-verbal cues and
rendered as the named effect rather than spoken aloud. Use them
sparingly inside the spoken text.
| Token | Effect |
|---|---|
[laughter] | Laughter |
[sigh] | Sigh |
[breath] | Audible breath |
[gasp] | Sharp inhale |
[chuckle] | Light chuckle |
[clear-throat] | Throat clear |
[question] | Inflected question lift |
[surprise] | Surprised reaction |
[whisper] | Whisper segment |
[shouted] | Raised voice |
[crying] | Tearful delivery |
"Grand day! [laughter] this is so wonderful. [clear-throat] you know?"
Auto behaviour
Any slot left asauto (or omitted) is filled by the engine based on
the rest of the profile. Setting all four primary axes (gender,
age, pitch, style) to auto and supplying only a free-form
description is a valid pattern — the engine infers a coherent voice
from the prose.
Notes
voice_idis only present in the response whensave_aswas supplied. Without it, the audio is rendered and returned, but no catalog entry is created.instructionechoes the composed instruction string sent to the engine — useful for debugging which slot combinations the model actually saw.fallback: trueindicates the dispatcher routed through a profile-matched zero-shot path rather than the native instruct path (transparent to the caller; quality is comparable).- Designed voices saved via
save_asround-trip into the catalog and can be enumerated via the Voice List endpoint.