Skip to content
AI Primer
release

Zyphra releases ZONOS2: 8B sparse MoE TTS with zero-shot voice cloning

Zyphra released ZONOS2 under Apache 2.0 with 8B total parameters, 900M active, zero-shot voice cloning, 44.1 kHz DAC audio, and ZTTS1-Eval. The release includes open weights, inference code, and eval code, so teams can run real-time multilingual TTS without a hosted-only stack.

3 min read
Zyphra releases ZONOS2: 8B sparse MoE TTS with zero-shot voice cloning
Zyphra releases ZONOS2: 8B sparse MoE TTS with zero-shot voice cloning

TL;DR

  • Zyphra shipped ZONOS2 as an Apache 2.0 open-weights TTS model, and ZyphraAI's links post points to the official announcement, weights, inference repo, and eval repo.
  • According to ZyphraAI's launch thread, ZONOS2 uses an 8B sparse MoE with 900M active parameters, a rare scale point for a real-time open TTS stack.
  • ZyphraAI's thread recap says voice cloning is zero-shot, outputs 44.1 kHz audio through DAC tokens, and drops phonemizers in favor of raw UTF-8 bytes for multilingual and code-switched speech.
  • The official blog post adds two details the tweet thread only hints at: Zyphra says throughput is 4x higher than Zonos-v0.1, and the self-hosted server exposes both a native /tts/generate API and an OpenAI-style /v1/audio/speech endpoint in the GitHub repo.

You can try it in Zyphra Cloud, skim the model card for language tiers, and read the server README for the exact local launch command. The interesting bit is how much of the stack is actually here: open weights, inference code, a hosted playground, and a new benchmark repo all landed on day one.

Open release surface

This release looks more like a runnable stack than a teaser. The launch thread shipped the model, while the follow-up post linked the weights, inference code, eval code, cloud playground, and blog in one place.

For engineers evaluating self-hosting, the GitHub repo is unusually concrete:

  • Linux x86_64 only
  • NVIDIA GPU required for local inference
  • uv sync for setup
  • uv run python -m zonos2 --model-path Zyphra/ZONOS2 to start the server
  • default local endpoint at http://localhost:1919
  • browser UI plus streaming HTTP API
  • OpenAI-compatible /v1/audio/speech endpoint

The official announcement also says Zyphra is serving ZONOS2 free for a promotional period on AMD-backed cloud infrastructure.

MoE, bytes, and DAC

Zyphra's main claim is the usual real-time TTS tradeoff, quality versus speed, got looser here. The company describes ZONOS2 as the first open-source sparse MoE TTS release, with 8B total parameters and 900M active.

The technical choices break down cleanly:

  • Sparse MoE backbone: the blog post says the move from the earlier 1.6B model to 8B came with 4x higher real-time throughput after dropping classifier-free guidance.
  • Raw UTF-8 bytes instead of a phonemizer: ZyphraAI's thread recap says that improves lower-resource coverage, boosts Chinese, Korean, and Japanese, and allows native mid-sentence code-switching.
  • DAC token prediction: ZyphraAI's thread recap says the model generates 44.1 kHz audio through Descript Audio Codec tokens rather than a lower-fidelity codec path.
  • Zero-shot cloning: the model card and official post both tie cloning quality to an ECAPA-TDNN speaker embedding setup.

The Hugging Face card also publishes language support tiers, with Tier 1 covering English, Mandarin Chinese, and Japanese, and broader Tier 2 and Tier 3 coverage extending across European and Asian languages.

ZTTS1-Eval

Zyphra shipped a benchmark with the model, which is the other notable part of this launch. ZyphraAI's benchmark post says ZTTS1-Eval spans clean and in-the-wild sets across up to 17 languages, and uses Qwen3-ASR, ReDimNet, MSR-UTMOS, plus prosody metrics.

The official write-up frames that benchmark around a specific argument: many TTS evals reward speech that is unusually clean for ASR systems, even when it drifts away from the source speaker. Zyphra says ZONOS2 is tuned to hold onto speaker similarity and prosody, then exposes both a stable mode and an expressive mode to let users trade cleaner output against stricter voice fidelity.

Share on X