Skip to content
AI Primer
release

Zyphra releases folded TSP with 173M tok/s on 1,024 MI300X GPUs

Zyphra published folded Tensor and Sequence Parallelism, claiming 173M tok/s versus 86M for matched TP+SP on 1,024 MI300X GPUs. The design keeps more replicas inside a node, reducing per-GPU memory pressure and cross-node communication.

6 min read
Zyphra releases folded TSP with 173M tok/s on 1,024 MI300X GPUs
Zyphra releases folded TSP with 173M tok/s on 1,024 MI300X GPUs

TL;DR

  • Zyphra says its new folded Tensor and Sequence Parallelism, or TSP, puts a weight shard and a sequence shard on the same GPU, cutting per device memory without needing separate TP and SP axes ZyphraAI's launch thread.
  • In Zyphra's example, a setup that would normally need 16 GPUs for 8 way TP plus 2 way SP fits into 8 GPUs with TSP, which keeps a full replica inside one node instead of spilling onto slower inter-node links ZyphraAI's 16 GPU example and ZyphraAI's folded-axis explanation.
  • The official research post says TSP used 38.8 GB per GPU at 128K context, versus 70.0 GB for TP and 85.0 GB or 140.0 GB for the tested TP+SP layouts.
  • Zyphra's headline throughput result was 173M tok/s versus 86M for matched TP+SP on 1,024 MI300X GPUs at 128K context and 8 GPUs per model copy ZyphraAI's throughput claim.
  • Zyphra researcher Quentin Anthony said the design has already been extended to MoE MLPs and will be used for training and inference of larger models QuentinAnthon15 on MoE support.

You can read Zyphra's official post, jump straight to the paper, and check the same day's Zyphra Cloud launch, where the company pitched long-context inference on MI355X for Kimi K2.6, DeepSeek V3.2, and GLM 5.1. One useful caveat sits in the paper itself: TSP saves memory by doing both weight and sequence sharding on one axis, but it pays for that with extra communication. Another is in [teortaxesTex's screenshot of Fig. 7](src:8|teortaxesTex's Fig. 7 post), which highlights regimes where TSP can even communicate less than TP.

TP plus SP multiplies the device count

The first useful thing in the launch thread is not the benchmark, it is the topology argument. Standard TP shards weights, standard SP shards activations, and combining them the usual way consumes the product of both degrees.

According to ZyphraAI's 16 GPU example, 8 way TP plus 2 way SP needs 16 GPUs for one replica. ZyphraAI's node-topology note adds why that matters in practice: most nodes still center on 8 tightly connected GPUs, so anything larger starts talking over Ethernet or InfiniBand instead of staying on NVLink or Infinity Fabric.

The paper makes the same point more formally. Separate TP and SP axes consume T×S ranks, which means fewer ranks left for data parallel replicas, and more chances that the model-parallel group spills outside the high-bandwidth intra-node domain.

Folded axis

TSP's core move is simple enough to fit in one sentence: use one device axis of size N, where N is the larger of the TP or SP degree, and make each rank own both a weight shard and a sequence shard.

That gives TSP the memory profile of combined TP and SP without the two-dimensional layout. The official post says each GPU stores 1/D of the model state and 1/D of the activation sequence across a D-way TSP group.

Quentin Anthony added one detail missing from the short thread. In QuentinAnthon15 on MoE support, he said the team has already extended the design to MoE MLPs, not just dense layers, and plans to use it for larger training and inference runs.

Attention and MLP schedules

Zyphra did not present TSP as one generic communication pattern. It uses one runtime schedule for attention and another for the MLP.

The paper and research post break it down like this:

  • Attention: one rank broadcasts packed attention weights, each GPU computes local Q, K, and V on its sequence shard, then ranks all-gather K and V so local tokens can attend over full context.
  • Gated MLP: weight shards circulate in a ring while each GPU accumulates partial outputs locally.
  • Why the MLP schedule matters: the ring removes the standard tensor-parallel all-reduce in the MLP, which Zyphra says is easier to overlap with GEMM compute.
  • Tradeoff: TSP reduces memory by making every GPU participate in both TP-style and SP-style communication, so the design adds communication volume even as it shrinks the replica footprint.

That last point is the real caveat. The paper explicitly says TSP is not a replacement for every existing scheme, but another axis that becomes attractive when memory pressure or topology would otherwise force groups over slow links.

Memory at long context

The clearest result is the memory curve. TSP stays close to TP when context is short and weights dominate, then separates as activation memory takes over.

The official post gives the concrete 128K numbers:

  • TSP: 38.8 GB per GPU
  • TP: 70.0 GB per GPU
  • TP+SP layouts tested by Zyphra: 85.0 GB and 140.0 GB per GPU

At 16K, Zyphra says TSP and TP were nearly the same, 31.0 GB versus 31.5 GB. That is the expected regime where weight memory dominates and SP has less to contribute.

A good extra detail comes from teortaxesTex's Fig. 7 post, which screenshots the paper's communication heatmap. The break-even boundary in that figure is BS > 8h, and the green cells mark regimes where TSP's forward communication is lower than TP's, not just lower than TP+SP's.

Throughput on 1,024 MI300X GPUs

The benchmark Zyphra led with is a scale result, not a single-node microbenchmark. On 128 nodes, or 1,024 MI300X GPUs, the company says TSP beat matched TP+SP baselines across 32K to 128K contexts, with the gap widening at higher folded degrees.

The headline number from ZyphraAI's throughput claim is 173M tok/s for TSP versus 86M tok/s for matched TP+SP at 128K context and D=8. The official post describes that as more than 2x speedup.

Two boundaries matter here:

  • Zyphra compares against matched TP+SP baselines, not every possible mix of TP, SP, PP, and EP.
  • The result sits on MI300X hardware, with the whole pitch built around keeping the folded group inside one high-bandwidth node.

That is why the post repeatedly frames TSP as hardware-aware parallelism rather than a universal winner.

Zyphra Cloud on MI355X

Zyphra paired the research drop with a product launch. A few hours later, it announced Zyphra Cloud and Zyphra Inference, an AMD-first serverless inference stack for long-horizon agentic workloads on MI355X via TensorWave ZyphraAI's Zyphra Cloud launch.

The launch thread says Zyphra Inference starts with three open-weight models ZyphraAI's model list:

  • DeepSeek V3.2, with V4 coming soon
  • Kimi K2.6
  • GLM 5.1

The same thread makes the hardware case in concrete numbers. ZyphraAI's MI355X specs claim says MI355X brings 288 GB of HBM3E and 8 TB/s of bandwidth per GPU, while ZyphraAI's Kimi residency example claims an 8x MI355X node can keep about 184 Kimi K2.6 sessions resident at 256K context, versus about 100 on 8x B200.

Quentin Anthony framed the product side as a way to ship Zyphra's AMD systems work faster than its internal model training cycle QuentinAnthon15 on productizing systems work. In QuentinAnthon15 on market demand, he tied that directly to multi-turn agent workloads, where long contexts and heavy KV or prefix cache pressure make HBM headroom a first-order constraint.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 4 threads
TL;DR2 posts
TP plus SP multiplies the device count1 post
Folded axis1 post
Zyphra Cloud on MI355X5 posts
Share on X