Skip to content
AI Primer
release

Google DeepMind releases Decoupled DiLoCo with 12B Gemma training across 4 US regions

Google DeepMind introduced Decoupled DiLoCo, a distributed-training method that trained a 12B Gemma model across four US regions and mixed TPU6e/v5p hardware while tolerating failures. It matters because it targets the networking and uptime bottlenecks that make frontier training geographically rigid and operationally fragile.

3 min read
Google DeepMind releases Decoupled DiLoCo with 12B Gemma training across 4 US regions
Google DeepMind releases Decoupled DiLoCo with 12B Gemma training across 4 US regions

TL;DR

  • Google DeepMind says GoogleDeepMind's launch thread introduces Decoupled DiLoCo as a way to keep large training runs going even when a chip or unit fails, instead of requiring near-perfect lockstep across identical hardware.
  • In GoogleDeepMind's technical thread, the team says it trained a 12B Gemma model across four US regions over low-bandwidth links, which is the concrete scale claim attached to this release.
  • The same GoogleDeepMind post says the system mixed TPU6e and TPUv5p hardware generations without a training slowdown, a direct shot at the usual same-cluster, same-chip assumption.
  • According to GoogleDeepMind's failure-recovery thread, Decoupled DiLoCo also isolates artificial hardware failures during training and reintegrates offline units when they return.

You can read Google DeepMind's official blog post, jump from the launch thread to the technical details, and check

for the core idea: one learner can keep stepping while another is stalled.

Four-region Gemma run

The headline result is specific, not hand-wavy: Google DeepMind says it trained a 12B Gemma model across four US regions using low-bandwidth networks. That makes this less about a generic distributed-systems paper and more about whether frontier-style training can stop treating geography as a hard constraint.

The company frames the bottleneck in its opening thread as synchronization itself. Standard large-scale training wants identical chips moving in near lockstep, so one failure can stall the whole run.

Self-healing training

Google DeepMind says it injected artificial hardware failures during training, then kept the run operating while the system isolated the disruption and later brought offline units back in. The interesting part is not just fault tolerance, but continued training during the outage.

That claim lines up with the architecture sketch in

, where one learner unit stops while another continues exchanging updates through a syncer layer.

Mixed TPU generations

The other notable claim is hardware heterogeneity. In the same thread, Google DeepMind says Decoupled DiLoCo mixed TPU6e and TPUv5p without slowing training performance.

If that holds beyond this experiment, it chips away at a second rigidity in large training runs: not just where the hardware sits, but whether the fleet has to be perfectly uniform before training can start.

Pathways and DiLoCo lineage

Google DeepMind says Decoupled DiLoCo combines two earlier ingredients:

  • Pathways, which lets different chips share data and work at their own pace, according to GoogleDeepMind's thread
  • DiLoCo, which cuts the bandwidth needed across distributed centers, again in the thread

That lineage explains why the launch emphasizes both resilience and low-bandwidth operation. As osanseviero's reaction put it, the attraction here is global distributed training that does not depend on fat interconnects between every site.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 1 thread
Pathways and DiLoCo lineage1 post