Skip to content
AI Primer
TOPIC30 stories

GPU Infrastructure

GPU allocation, infra topology, and compute platform design.

NEWS12th May
Perplexity benchmarks Qwen3 235B on GB200 NVL72: NVLS latency drops from 586 µs to 313 µs

Perplexity published serving results for post-trained Qwen3 235B on NVIDIA GB200 NVL72 and argues Blackwell materially outperforms Hopper for large MoE inference. The deltas show up in NVLS all-reduce latency, MoE prefill combine time, and high-speed decode throughput.

NEWS1w ago
Anthropic doubles Claude Code 5-hour limits after SpaceX Colossus 1 compute deal

Anthropic said a SpaceX compute deal will add 300+ MW and 220,000+ NVIDIA GPUs, and it doubled Claude Code 5-hour limits across paid plans. It also raised Opus API ceilings; users should still watch the unchanged weekly caps.

RELEASE1w ago
Zyphra releases ZAYA1-8B with <1B active params and Markovian RSA reasoning

Zyphra released ZAYA1-8B, an Apache-2.0 reasoning MoE with compressed-convolutional attention and bounded-context Markovian RSA test-time compute. The model targets math and coding workloads while keeping the active parameter count below 1B.

RELEASE1w ago
OpenAI opens Multipath Reliable Connection for 100,000-plus GPU training clusters

OpenAI and partners released Multipath Reliable Connection, an RDMA transport that spreads training traffic across multiple network paths and is already deployed on the company's largest clusters. The protocol targets congestion and failure recovery in giant GPU trainings, and teams building similar clusters should track the Open Compute Project release.

RELEASE1w ago
Zyphra releases folded TSP with 173M tok/s on 1,024 MI300X GPUs

Zyphra published folded Tensor and Sequence Parallelism, claiming 173M tok/s versus 86M for matched TP+SP on 1,024 MI300X GPUs. The design keeps more replicas inside a node, reducing per-GPU memory pressure and cross-node communication.

RELEASE1w ago
Zyphra Inference launches MI355X endpoints for DeepSeek V3.2, Kimi K2.6, and GLM 5.1

Zyphra launched serverless inference on AMD MI355X for DeepSeek V3.2, Kimi K2.6, and GLM 5.1, aimed at long-horizon agent workloads. The service leans on high-HBM nodes to keep more long-context sessions resident and reduce queueing.

RELEASE1w ago
Moondream releases Photon 1.2.0 with Apple Silicon, native Windows CUDA, and 23 ms B200 latency

Moondream shipped Photon 1.2.0, expanding its inference engine to Apple Silicon, Windows CUDA, Blackwell, and Jetson Thor, then outlined how custom Metal kernels and fused ops made local vision practical without MLX. That broadens deployment options for edge and on-device vision workloads while keeping server-class latency on B200 systems.

RELEASE2w ago
FlashQLA releases TileLang linear-attention kernels with 2–3x forward speedups

Alibaba Qwen introduced FlashQLA, a TileLang-based linear-attention kernel stack that reports 2–3x faster forward passes and 2x faster backward passes. The release gives edge and long-context deployments a new optimization lever below the model layer itself.

NEWS2w ago
SGLang supports DeepSeek V4 with 199 tok/s on B200 and 240 tok/s at 900K context

SGLang and Miles published a technical breakdown of their DeepSeek V4 day-zero stack, including ShadowRadix caching, Flash Compressor, FP4 expert-weight handling, and measured B200/H200 throughput. That gives deployers concrete serving and training-path numbers for V4 beyond generic launch-day compatibility claims.

RELEASE3w ago
Google DeepMind releases Decoupled DiLoCo with 12B Gemma training across 4 US regions

Google DeepMind introduced Decoupled DiLoCo, a distributed-training method that trained a 12B Gemma model across four US regions and mixed TPU6e/v5p hardware while tolerating failures. It matters because it targets the networking and uptime bottlenecks that make frontier training geographically rigid and operationally fragile.

RELEASE3w ago
DeepSeek releases Tile Kernels with Engram, mHC, and FP4/FP8 ops for SM90 and SM100 GPUs

DeepSeek published Tile Kernels, an open-source TileLang repo covering Engram, mHC, MoE routing, and FP4/FP8 kernels, with claims that some are already used in internal training and inference. That matters because it exposes reusable low-level performance work behind DeepSeek’s stack instead of keeping the kernels fully private.

NEWS3w ago
Google launches TPU 8t and TPU 8i with 3x pod compute and 1,152-chip inference pods

Google unveiled eighth-generation TPUs split into TPU 8t for training and TPU 8i for inference, saying 8t delivers nearly 3x per-pod compute over Ironwood while 8i links 1,152 chips in a pod. Google is tuning its hardware stack for larger training runs and lower-latency agent inference at cloud scale.

RELEASE4w ago
Hugging Face Hub launches Kernels with 1.7x-2.5x PyTorch speedups

Hugging Face introduced Kernels on the Hub to publish pre-compiled GPU kernels matched to GPU, PyTorch version, and OS. The packaging makes kernel optimizations shareable and claims 1.7x to 2.5x speedups over PyTorch baselines with torch.compile compatibility.

RELEASE1mo ago
tinybox ships red v2 with 4x 9070 XT and 64 GB GPU RAM for $12,000

tiny corp is shipping tinybox red v2 at $12,000 with four 9070 XT GPUs and 64 GB of GPU memory, alongside higher-end Blackwell systems. Buyers are weighing the bundled tinygrad stack against DIY rigs, model-fit limits, and cloud economics.

RELEASE1mo ago
Arm launches AGI CPU with 136 Neoverse V3 cores and 272-core blade

Arm introduced its first production server chip under its own banner, with up to 136 Neoverse V3 cores and a 272-core dual-node reference blade. The launch pushes Arm deeper into direct datacenter silicon for agentic AI workloads, not just IP licensing.

NEWS1mo ago
Artificial Analysis launches AA-AgentPerf for 200-turn, 100K-token coding traces

Artificial Analysis introduced AA-AgentPerf to benchmark hardware on real coding-agent traces instead of synthetic chat prompts. The benchmark reports users per accelerator, kW, dollar, and rack, so teams can compare production cost and throughput more realistically.

NEWS1mo ago
Google Research launches TurboQuant: 6x KV-cache compression, 8x faster H100 attention

TurboQuant claims 6x KV-cache memory reduction and up to 8x faster attention on H100s without retraining or quality loss on long-context tasks. If those results hold in serving stacks, teams should revisit long-context cost, capacity, and vector-search design.

RELEASE1mo ago
Miles adds ROCm support on AMD Instinct and raises AIME to 0.729

Miles added ROCm support for AMD Instinct clusters and reported GRPO post-training gains on Qwen3-30B-A3B, including AIME rising from 0.665 to 0.729. It matters if you are evaluating rollout-heavy RL jobs off NVIDIA and want concrete throughput and step-time numbers before porting.

RELEASE1mo ago
Google Colab releases MCP server for notebook GPUs via uvx install

Google open-sourced a Colab MCP server that exposes code execution, connection, and notebook editing to MCP-compatible agents. It gives local coding agents a direct bridge to cloud GPUs without hand-rolled notebook automation.

NEWS1mo ago
Meta raises AI capacity with up to $27B Nebius infrastructure deal

Meta agreed to buy up to $27 billion of AI infrastructure from Nebius over five years, including $12 billion of dedicated capacity and optional overflow tied to Vera Rubin deployments. Plan for tighter next-generation GPU supply as hyperscalers lock in capacity years ahead of spot demand.

NEWS1mo ago
NVIDIA launches Nemotron Coalition with Mistral, LangChain, and Perplexity

NVIDIA introduced a coalition of labs and platform vendors to co-develop open frontier models, including Mistral, LangChain, Perplexity, Cursor, Reflection, Sarvam, and Black Forest Labs. Watch it if you want open-model efforts tied to DGX Cloud, NIM, and production tooling instead of weights alone.

NEWS1mo ago
DistCA claims 1.35x long-context training gains with disaggregated core attention

Researchers released DistCA, a training system that offloads stateless core attention to dedicated servers and reports up to 1.35x throughput gains on long-context workloads. Evaluate it for very long-sequence training where attention imbalance strands GPUs and creates pipeline stalls.

NEWS2mo ago
Researchers report US data centers may need 697–1,451 MGD of new water capacity by 2030

Researchers report US data centers may need 697–1,451 million gallons per day of new peak water capacity by 2030 in a baseline scenario, even if national totals stay small. Model local peak-day water constraints, not just annual averages, when planning new clusters.

NEWS2mo ago
FlashAttention-4 benchmarks 1613 TFLOPs/s on B200, 1.3x over cuDNN 9.13

FlashAttention-4 targets Blackwell bottlenecks with redesigned pipelines, software-emulated exponential work, and lower shared-memory traffic, reaching up to 1613 TFLOPs/s on B200. If you serve long-context models on B200 or GB200, benchmark it against your current cuDNN and Triton kernels before optimizing elsewhere.

NEWS2mo ago
Ollama updates cloud to NVIDIA B300 for Kimi K2.5 and GLM-5 on $0, $20, and $100 plans

Ollama says its cloud now runs Kimi K2.5 and GLM-5 on NVIDIA B300 hardware while keeping fixed $0, $20, and $100 plans. Try it if you want hosted open models with more predictable spend for always-on agent workloads.

RELEASE2mo ago
FastVideo claims 5-second 1080p generation in 4.55s on one GPU

FastVideo published an LTX-2.3 inference stack that claims 5-second 1080p text-image-to-audio-video generation in 4.55 seconds on a single GPU. If the results hold up, test it for lower-cost interactive video generation and faster iteration loops.

NEWS2mo ago
Epoch AI reports top chip designers used about 90% of HBM and CoWoS supply in 2025

Epoch AI estimates that NVIDIA, Google, AMD, and Amazon consumed nearly all high-bandwidth memory and advanced packaging tied to frontier AI chips in 2025. Track this if you are planning compute, custom silicon, or open-weight infrastructure strategy.

NEWS2mo ago
Thinking Machines Lab launches 1GW Vera Rubin partnership with NVIDIA

Thinking Machines and NVIDIA announced a multi-year plan to deploy at least 1 gigawatt of Vera Rubin systems for training and customizable AI platforms. Watch it as a marker of how frontier training capacity is concentrating into a few very large infrastructure bets.

RELEASE2mo ago
Together GPU Clusters adds autoscaling, RBAC, observability, and self-healing

Together GPU Clusters added autoscaling, RBAC, observability, and self-healing controls to its managed cluster product. Use it if your team is moving from ad hoc GPU pools to production training or inference and needs more platform controls out of the box.

NEWS2mo ago
Oracle says Abilene AI data center stays on schedule with 200MW operational

Oracle disputed reports of delays at the Abilene site, said 200MW is already operational, and reiterated that the campus supports liquid cooling and multiple hardware generations. Infra teams tracking capacity and supplier signals should treat the recent delay narrative as disputed.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.