releaseMay 6, 2026

OpenAI opens Multipath Reliable Connection for 100,000-plus GPU training clusters

OpenAI and partners released Multipath Reliable Connection, an RDMA transport that spreads training traffic across multiple network paths and is already deployed on the company's largest clusters. The protocol targets congestion and failure recovery in giant GPU trainings, and teams building similar clusters should track the Open Compute Project release.

4 min read

OpenAI opens Multipath Reliable Connection for 100,000-plus GPU training clusters

TL;DR

OpenAI's deployment thread says Multipath Reliable Connection, or MRC, is already running on OpenAI's largest training supercomputers, including its Oracle Cloud site in Abilene and Microsoft's Fairwater systems.
In TheRealAdamG's screenshot of OpenAI's post, OpenAI describes MRC as a new protocol built into 800 Gb/s NICs that can spread one transfer across hundreds of paths and route around failures in microseconds.
OpenAI's launch thread and sk7037's post both frame the payoff the same way: faster training, better failure tolerance, and less wasted GPU time on giant clusters.
According to ai_for_success's summary, OpenAI is pitching MRC for 100,000-plus GPU clusters with only two switch tiers, while kimmonismus's thread argues the bigger subtext is Ethernet taking aim at territory long dominated by InfiniBand.

You can read OpenAI's announcement, skim NVIDIA's Spectrum-X writeup, and even jump to the OpenAI Podcast episode where Mark Handley and Greg Poynting talk through why training networks now fail on synchronization, not just raw bandwidth.

MRC

OpenAI released MRC with AMD, Broadcom, Intel, Microsoft, and NVIDIA, then handed it to the Open Compute Project as an open protocol. The company says it is already in production on its biggest frontier-model training clusters.

The core change is simple: one RDMA connection no longer has to ride one path. In TheRealAdamG's screenshot of OpenAI's post, OpenAI says MRC extends RoCE with SRv6-based source routing so a single transfer can use hundreds of paths, dodge failures in microseconds, and keep the network control plane simpler.

Packet spraying

MRC is aimed at collective communication, the ugly part of training where huge numbers of GPUs have to exchange updates in lockstep. LLMpsycho's reply reduces the pitch to one sentence: it targets the collective bottleneck that makes large clusters inefficient.

The mechanics that surfaced across the evidence are more specific:

Adaptive packet spraying spreads traffic across many paths instead of pinning each flow to one route, per ai_for_success's summary.
SRv6 static source routing lets the fabric bypass failed links or switches in microseconds, according to the same summary.
Hardware-level multipath routing means a bad path does not stall the whole GPU connection, as kimmonismus's thread and rohanpaul_ai's explanation both describe it.
OpenAI says the design also supports simpler control planes, in TheRealAdamG's screenshot of OpenAI's post.

That combination matters because training slowdowns often come from tail latency and partial failures, not from average throughput alone.

100,000 GPUs

The most concrete scale claim in the evidence is OpenAI's push for 100,000-plus GPU clusters. ai_for_success's summary says MRC is designed to support that size with only two tiers of switches, which would cut hardware complexity and power alongside congestion.

OpenAI also named real deployments instead of leaving this as a lab protocol. OpenAI's deployment thread places MRC on its Oracle Cloud Infrastructure site in Abilene, Texas, and on Microsoft's Fairwater supercomputers. In sk7037's post, OpenAI executive sk7037 adds that the project started years earlier with Intel and positions it as part of a broader effort to improve compute systems across the stack, not just model quality.

The companion OpenAI Podcast episode reinforces the pitch: giant training systems need to stay synchronized across record chip counts, and networking failures now burn enough accelerator time to deserve their own protocol work.

Spectrum-X

The most interesting caveat comes from outside OpenAI's own copy. NVIDIA's Spectrum-X post and the commentary around it say MRC was first proven and optimized on Spectrum-X Ethernet hardware, which makes the release look like both an open standard push and a platform move.

That is the strategic wrinkle in the launch:

OpenAI and partners published MRC through OCP, per OpenAI's deployment thread.
NVIDIA says the protocol was optimized first for Spectrum-X, as summarized by rohanpaul_ai's explanation.
kimmonismus's thread argues that this gives Ethernet a stronger shot at large AI fabrics while still reinforcing NVIDIA's full-stack position underneath.

OpenAI's official framing is cluster efficiency and resilience. The extra reveal is that the networking stack itself is now product surface, and the companies selling AI infrastructure want the transport layer to be part of that battle.

TL;DR

MRC

Packet spraying

100,000 GPUs

Spectrum-X

Discussion across the web