VoxCPM releases 2B voice model with 3-second cloning and 30-language support
OpenBMB released VoxCPM on GitHub with text-described voice design, 3-second cloning, 48kHz audio, and 30-language support. The Apache 2.0 release makes multilingual voice work and local self-hosting cheaper.


TL;DR
- 7_eito_7's launch thread surfaced the headline features fast: text-prompted voice generation, 3-second cloning, and support for 30 languages.
- the same thread's spec post matched the official VoxCPM 2 docs, which list a 2B parameter model trained on 2.36 million hours of multilingual data with 48kHz output.
- According to the Hugging Face model card, Voice Design does not need reference audio, and the cloning post says style controls like pace and brightness still work after cloning.
- the repo link shared in-thread points to an Apache-2.0 release on GitHub, where OpenBMB also claims real-time streaming as low as about 0.3 RTF on an RTX 4090, or about 0.13 with Nano-VLLM acceleration.
- Aakash Gupta's reaction thread framed the bigger market angle bluntly: an open model with commercial licensing and local deployment puts direct price pressure on paid voice APIs.
You can browse the GitHub repo, skim the official model page, and listen through the demo gallery. The interesting part is not just that VoxCPM2 clones voices from a short clip. It also generates new voices from text alone, skips language tags across 30 languages, and already ships with a small ecosystem that spans Hugging Face weights, ComfyUI integrations, and a C++ runtime.
Voice Design
The cleanest creative unlock is Voice Design: describe a voice in natural language, then synthesize it without a reference clip.
The official model card says those descriptions can specify gender, age, tone, emotion, and speaking pace. The repo positions that as a first-class feature, not a prompt hack bolted onto cloning.
That matters because most voice workflows still start with hunting for a usable sample. VoxCPM2 turns that first step into plain text.
3-second cloning with style controls
The cloning pitch is short enough to fit in one line: a few seconds of audio to recover the speaker, plus extra controls to steer delivery. The evidence thread calls out adjustments like brighter tone and faster pacing, while the official Hugging Face page adds that style guidance can steer emotion, pace, and expression without throwing away timbre.
OpenBMB also splits cloning into two modes in its repo docs: a short-clip controllable clone, and a higher-fidelity path that uses both reference audio and transcript to preserve rhythm, emotion, and speaking style more closely. For creators, that is a useful distinction between “good enough to dub this” and “keep the performance intact.”
30 languages, 48kHz, no language tags
The official stack is unusually compact:
- VoxCPM 2 docs: 2B parameters, 48kHz output, 30 languages, March 2026 release.
- GitHub: tokenizer-free diffusion autoregressive TTS built on MiniCPM-4.
- Hugging Face: direct multilingual synthesis with no language tag needed.
- The launch thread's language post: practical examples aimed at narration, localization, and multilingual content production.
The no-language-tag detail is the sneaky good part. A lot of multilingual TTS stacks still ask users to manage language IDs or separate models. OpenBMB is selling a simpler interface: write the text in the target language and synthesize directly.
The demo page also ties VoxCPM's audio quality claims to its tokenizer-free design. The project argues that generating continuous speech representations, instead of discrete audio tokens, helps it keep more prosody and vocal detail intact.
Apache-2.0 changes the pricing math
OpenBMB released VoxCPM under Apache-2.0 on GitHub, which means commercial use is on the table without a hosted API gate. Aakash Gupta's thread pushed that point into startup-economics territory, contrasting local generation costs with ElevenLabs pricing and arguing that open voice synthesis is moving toward commodity status.
That hot take is still a hot take, but the ingredients are real. The repo says VoxCPM2 can run with as little as 8GB of VRAM, and the installation docs list CUDA as optional rather than mandatory. That opens the door to local experimentation on much cheaper hardware than most studio-grade voice stacks imply.
The ecosystem already looks production-minded
The most practical part of the release is how much packaging OpenBMB shipped around the core model.
According to the installation guide and model docs, VoxCPM already has:
- NanoVLLM-VoxCPM for faster streaming inference.
- VoxCPM.cpp for C++ deployment.
- VoxCPM-ONNX for portable runtime support.
- VoxCPMANE and MLX-Audio hooks for Apple-side inference.
- ComfyUI integrations and TTS WebUI support.
The GitHub release list shows 2.0.2 landed on April 8, a few days before the tweets started spreading. For a creative tool story, that is the part worth bookmarking: this arrived less like a research teaser and more like a model, docs, weights, runtimes, and UI adapters dropped in one shot.