Serving stacks and runtime systems for model inference.
New discussion around TurboQuant focuses on its 2.5-bit mixed-precision setup and working PyTorch and llama.cpp implementations. The technique is moving from a research claim into deployable KV-cache compression with concrete porting details.
Arm introduced its first production server chip under its own banner, with up to 136 Neoverse V3 cores and a 272-core dual-node reference blade. The launch pushes Arm deeper into direct datacenter silicon for agentic AI workloads, not just IP licensing.
tiny corp is shipping tinybox red v2 at $12,000 with four 9070 XT GPUs and 64 GB of GPU memory, alongside higher-end Blackwell systems. Buyers are weighing the bundled tinygrad stack against DIY rigs, model-fit limits, and cloud economics.
Compromised LiteLLM 1.82.7 and 1.82.8 wheels executed a malicious .pth file at install time to exfiltrate credentials, and PyPI quarantined the releases. Treat fresh-package installs and AI infra dependencies as supply-chain risk, and check startup hooks on affected systems.
Google Research said TurboQuant can shrink KV cache storage to 3 bits with roughly 6x less memory, and early implementations already surfaced in PyTorch, llama.cpp, and Atomic Chat. The work targets a core inference bottleneck for long-context serving on local and server hardware.
Cohere released a 2B speech-to-text model with 14 languages and top Open ASR scores, and upstreamed encoder-decoder optimizations to vLLM in the same launch. It is a self-hosted ASR option, so test accuracy and throughput on your own speech workload.
TurboQuant claims 6x KV-cache memory reduction and up to 8x faster attention on H100s without retraining or quality loss on long-context tasks. If those results hold in serving stacks, teams should revisit long-context cost, capacity, and vector-search design.
Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.
Miles added ROCm support for AMD Instinct clusters and reported GRPO post-training gains on Qwen3-30B-A3B, including AIME rising from 0.665 to 0.729. It matters if you are evaluating rollout-heavy RL jobs off NVIDIA and want concrete throughput and step-time numbers before porting.
A pure C and Metal engine streams 209GB of MoE weights from SSD and reports tool-calling support in 4-bit mode on a laptop-class Mac. It is a concrete benchmark for teams exploring expert streaming, quantization, and page-cache tricks on consumer hardware.
OpenAI says Responses API requests can reuse warm containers for skills, shell, and code interpreter, cutting startup times by about 10x. Faster execution matters more now that Codex is spreading to free users, students, and subagent-heavy workflows.
New write-ups on Mamba-3 add more detail on its MIMO decode path, discretization changes, and complex-valued state updates. That gives infra teams a clearer basis for testing state-space models as inference-efficient alternatives in long-sequence or agent-heavy systems.
Unsloth Studio launched as an open-source web UI to run, fine-tune, compare, and export local models, with file-to-dataset workflows and sandboxed code execution. Try it if you want to move prototype training and evaluation off cloud notebooks and onto local or rented boxes.
Dreamverse paired Hao AI Lab's FastVideo stack with an interface for editing video scenes in a faster-than-playback loop, using quantization and fused kernels to keep latency below viewing time. The stack is interesting if you are building real-time multimodal generation or multi-user video serving.
Together introduced Mamba-3 and open-sourced kernels for a new MIMO state-space variant that targets decode efficiency and beats Mamba-2, GDN, and Llama 3.2 1B at 1.5B scale. Test it when deployment speed matters more than chasing another generic Transformer baseline.
Ollama 0.18.1 added OpenClaw web search and fetch plugins plus non-interactive launch flows for CI, scripts, and container jobs. Pair it with Pi and Nemotron 3 Nano 4B if you want unattended agent jobs on constrained hardware.
Mistral shipped Mistral Small 4, a 119B MoE model with 6.5B active parameters, multimodal input, configurable reasoning, and Apache 2.0 weights. Deploy it quickly in existing stacks if you use SGLang or vLLM, which added day-one support.
NVIDIA introduced a coalition of labs and platform vendors to co-develop open frontier models, including Mistral, LangChain, Perplexity, Cursor, Reflection, Sarvam, and Black Forest Labs. Watch it if you want open-model efforts tied to DGX Cloud, NIM, and production tooling instead of weights alone.
Researchers released DistCA, a training system that offloads stateless core attention to dedicated servers and reports up to 1.35x throughput gains on long-context workloads. Evaluate it for very long-sequence training where attention imbalance strands GPUs and creates pipeline stalls.
OpenClaw added Ollama as an official provider through openclaw onboard --auth-choice ollama, alongside documented OpenAI-compatible self-hosted backends such as vLLM. Use it to run Claw workflows against local or custom models instead of a single hosted stack.
FlashAttention-4 targets Blackwell bottlenecks with redesigned pipelines, software-emulated exponential work, and lower shared-memory traffic, reaching up to 1613 TFLOPs/s on B200. If you serve long-context models on B200 or GB200, benchmark it against your current cuDNN and Triton kernels before optimizing elsewhere.
Tiiny claims its pocket-sized local AI server can run open models up to 120B and expose an OpenAI-compatible local API without token fees. Privacy-sensitive teams should validate throughput and model quality before deploying always-on local agents.
Ollama says its cloud now runs Kimi K2.5 and GLM-5 on NVIDIA B300 hardware while keeping fixed $0, $20, and $100 plans. Try it if you want hosted open models with more predictable spend for always-on agent workloads.