| Endpoint | Produces | Use with |
|---|---|---|
| Image LoRA | Style or subject adapter for image generation | Image generation tools that accept lora_name |
| Video LoRA | Motion or style adapter for video generation | Video tools that accept lora_name |
| Music LoRA | Genre / style adapter for music generation | Music generation with lora_name |
| Voice Model | Cloned voice persisted to the catalog | T2A with the returned voice_id |
Image LoRA
Train a style or subject adapter from a dataset of images.lora_name: "my_brand_style"
to any image generation tool that supports LoRA stacking.
Video LoRA
Same shape as image LoRA, on a video dataset:generate_video_from_text or generate_video_from_image by
passing lora_name: "my_motion_style".
Music LoRA
Train a music style adapter — genre, instrumentation, or production treatment — from a corpus of reference tracks.lora_name: "lofi_2026".
Voice Model
Voice training is the heaviest of the four — it captures a speaker’s timbre across many samples rather than a single clip. For quick single-clip cloning, use Voice Clone instead.voice_id to any text-to-audio call.
Checking training status
All four endpoints return atask_id. Poll status the same way as
other long-running jobs:
Tips
- Dataset size: 10–30 reference items is the sweet spot for style adapters. More isn’t always better — diversity matters more than count.
- Naming: choose stable snake-case names; the name you supply is what you’ll pass to generation calls forever.
- Voice samples: 3–10 clips of 5–30 seconds each, clean audio, consistent room tone. Pre-process noisy field recordings with voice denoise first.
- Costs: training jobs are metered per-step. Watch the status
response’s
cost_estimatefield while the job runs.