Text to speech

text → speech via voxcpm2 — voice-cloning, 48 khz output, pure rust on homelab

tts

target text:

the text to synthesize. voxcpm2 is trained on 30 languages (arabic, chinese, english, french, german, hindi, italian, japanese, korean, portuguese, russian, spanish, thai, vietnamese and ~16 more) plus 9 chinese dialects. no language pin required — the tokenizer is multilingual.

reference voice (3–10 s of clean speech, wav/mp3/flac/m4a/ogg):

$ drag the voice sample you want to clone here $ or click to browse

reference transcript (optional):

empty → reference mode: voice cloned from the audio latent only. populated → ref_continuation mode: the transcript is prepended to the target text inside the prefill, which anchors prosody more tightly to the reference. either works; empty is simpler and matches voxcpm2's default web demo when only an audio sample is provided.

cfg (classifier-free guidance):

higher = tighter voice-cloning match; lower = more varied prosody. upstream default is 2.0. below 1.5 the voice drifts; above 3.5 diction gets robotic.

temperature:

noise multiplier fed into each CFM step. 1.0 is unit gaussian (upstream default). 0 → deterministic but flat; 1.5+ → more expressive but can slur on long sentences.

min ar steps:

the ar loop won't honour the model's stop flag until this many steps have elapsed. guards against pathological 0.1 s outputs for well-formed prompts that would otherwise cut short.

max ar steps:

hard cap — the loop stops here even if the model hasn't emitted its stop flag. each step = 160 ms of audio at 48 khz, so 1500 = 240 s of audio ceiling. server-side also caps at 2048.

seed (optional):

integer seed for the gaussian-noise stream. same seed + same inputs → byte-identical output wav, useful for a/b-ing other knobs. blank = fresh getrandom seed per request.

[ idle ]