Text to speech
text → speech via voxcpm2 — voice-cloning, 48 khz output, pure rust on homelab
tts
the text to synthesize. voxcpm2 is trained on 30 languages (arabic, chinese, english, french, german, hindi, italian, japanese, korean, portuguese, russian, spanish, thai, vietnamese and ~16 more) plus 9 chinese dialects. no language pin required — the tokenizer is multilingual.
empty →
higher = tighter voice-cloning match; lower = more varied prosody. upstream default is 2.0. below 1.5 the voice drifts; above 3.5 diction gets robotic.
noise multiplier fed into each CFM step. 1.0 is unit gaussian (upstream default). 0 → deterministic but flat; 1.5+ → more expressive but can slur on long sentences.
the ar loop won't honour the model's stop flag until this many steps have elapsed. guards against pathological 0.1 s outputs for well-formed prompts that would otherwise cut short.
hard cap — the loop stops here even if the model hasn't emitted its stop flag. each step = 160 ms of audio at 48 khz, so 1500 = 240 s of audio ceiling. server-side also caps at 2048.
integer seed for the gaussian-noise stream. same seed + same inputs → byte-identical output wav, useful for a/b-ing other knobs. blank = fresh
$ drag the voice sample you want to clone here
$ or click to browse
reference mode: voice cloned from the audio latent only. populated → ref_continuation mode: the transcript is prepended to the target text inside the prefill, which anchors prosody more tightly to the reference. either works; empty is simpler and matches voxcpm2's default web demo when only an audio sample is provided.
getrandom seed per request.
[ idle ]