releaseMay 4, 2026

Zyphra Inference launches MI355X endpoints for DeepSeek V3.2, Kimi K2.6, and GLM 5.1

Zyphra launched serverless inference on AMD MI355X for DeepSeek V3.2, Kimi K2.6, and GLM 5.1, aimed at long-horizon agent workloads. The service leans on high-HBM nodes to keep more long-context sessions resident and reduce queueing.

4 min read

Zyphra Inference launches MI355X endpoints for DeepSeek V3.2, Kimi K2.6, and GLM 5.1

TL;DR

Zyphra launched Zyphra Inference as the first public piece of Zyphra Cloud, a serverless inference service for long-horizon agent workloads, according to ZyphraAI's launch post.
Day one model support is three open-weight endpoints, DeepSeek V3.2, Kimi K2.6, and GLM 5.1, with "V4 soon" attached to DeepSeek in ZyphraAI's model list.
Zyphra's pitch is mostly a memory story: ZyphraAI's MI355X hardware note says MI355X offers 288GB HBM3E and 8 TB/s bandwidth, and ZyphraAI's Kimi K2.6 example claims that translates into about 184 concurrent 256K-context agents on 8x MI355X versus about 100 on 8x B200.
The company says the service is built on years of AMD-specific systems work, including Tree Attention, custom MoE and attention kernels, and new parallelism schemes, per ZyphraAI's AMD research thread.
Zyphra is framing inference as the first layer of a broader stack that also includes agent infrastructure, environments, and compute, according to the main launch post and the Zyphra Cloud overview.

You can read the official product post, skim the cloud overview, and see Zyphra's own parallelism thread, which landed a couple of hours before the launch. The odd bit is how explicitly the pitch centers on KV cache pressure and resident sessions instead of generic tokens per second. PR copy also makes clear this is supposed to be the first piece of a larger AMD-first platform, not a one-off endpoint drop.

What shipped

The launch is narrow and pretty clear. Zyphra Cloud is the umbrella, and Zyphra Inference is the thing you can use today.

What is live now:

Serverless inference for open-weight models, per the launch post
AMD MI355X-backed deployment on TensorWave infrastructure, per the launch post and the press release
Three launch models: DeepSeek V3.2, Kimi K2.6, GLM 5.1, per ZyphraAI's model list
A near-term model roadmap note, "DeepSeek V4 soon," in the same model list

The official post describes the service as AMD-first and aimed at developers, enterprises, and frontier AI hyperscalers building long-context agent systems rather than short request-response apps.

MI355X memory economics

Zyphra's argument is that long-horizon agents are dominated by memory headroom, large KV caches, and prefix-cache reuse. That is the same framing Quentin Anthony uses in QuentinAnthon15's market explanation, where he says many-turn agentic runs create heavy KV and prefix-cache pressure.

The concrete hardware claims:

MI355X: 288GB HBM3E per GPU, 8 TB/s bandwidth, per ZyphraAI's hardware note
Kimi K2.6 replica size: about 545GB at INT4, per ZyphraAI's Kimi example
At 256K context, 8x B200 supports about 100 agents, while 8x MI355X supports about 184, per the same example

Anthony's thread adds the market-level framing: more HBM means fewer users pushed into queueing, more cache hits on many-turn workloads, and larger batch sizes, according to QuentinAnthon15's product note and his follow-up.

AMD-first systems work

Zyphra is selling more than raw hardware access. The company says the service packages up years of systems work it already did while training large models on AMD.

The pieces Zyphra names explicitly:

Tree Attention for AMD's point-to-point intra-node fabric, per ZyphraAI's AMD research thread
Novel parallelism schemes for AMD topologies, per the same thread
Optimized MoE and attention kernels, per the same thread
Long-context inference algorithms and custom kernels, per ZyphraAI's model list

Anthony says that shift from training to inference exposed new serving bottlenecks, and that some of the fixes are already shipped while others are still in progress, according to QuentinAnthon15's serving-regimes thread.

The rest of Zyphra Cloud

The launch post is about inference, but the cloud page is already broader. The Zyphra Cloud overview lists four product blocks: Agent (MAIA), Agent Environments, Inference, and Compute.

The page also lists features that have not been unpacked yet in the launch thread:

Multiplayer general agent for teams
Search, chat, and agentic workflows
Model and tool orchestration
CPU-based agent environments
Distributed training and RL
Large-scale simulation environments
GPU clusters, bare metal orchestration, and dedicated capacity

That makes the interesting part of this announcement less "new inference API" than "first public endpoint of a fuller agent stack." The official blog says inference is only the first step toward a unified platform for open, sovereign AI at scale, and the product page already shows the next layers.

TL;DR

What shipped

MI355X memory economics

AMD-first systems work

The rest of Zyphra Cloud

Discussion across the web