breakingMarch 9, 2026

Google LiteRT-LM PR adds Gemma4 NPU support ahead of an expected release

A Google bot-authored LiteRT-LM pull request references Gemma4 and AIcore NPU support, while multiple posts claim a largest version around 120B total and 15B active parameters. Engineers targeting on-device inference should wait for a formal model card before locking plans.

3 min read

Google LiteRT-LM PR adds Gemma4 NPU support ahead of an expected release

TL;DR

A GitHub pull request in Google's google-ai-edge/LiteRT-LM repo explicitly says "Add NPU support for AIcore for Gemma4 model," and the screenshots shared by the first sighting and a second capture show it coming from copybara-service[bot], Google's internal sync bot.
Separate posts are circulating a parameter rumor: one widely shared claim says Gemma 4's largest variant could be "~120B total" with "15B active" parameters, and a follow-on post repeats the same mixture-of-experts-style sizing, but there is no official model card in the evidence.
The repo reference matters because it ties Gemma4 to LiteRT-LM and "AIcore" NPU support in the PR text from the GitHub screenshot, which points to on-device or edge inference work rather than just a model name leak.
Release timing still looks unofficial: a launch-week post quotes Logan Kilpatrick saying it's "going to be a fun week of launches," but the available evidence still stops short of a formal announcement, weights release, or API spec.

What the PR actually shows

The concrete signal is narrow but real. The screenshots shared in the original post and a second post show an open PR titled "Add NPU support for AIcore for Gemma4 model" in google-ai-edge/LiteRT-LM, with a comment from copybara-service[bot] repeating the same text. The image OCR in both posts identifies Copybara-Service as "an helper app for Google Copybara, synchronizing repositories maintained by Google," which makes this look like an internal-to-public repo sync rather than a random third-party fork.

For engineers, the interesting part is not just the string "Gemma4." It is the coupling of Gemma4 with LiteRT-LM, NPU support, and AIcore in the PR title itself. That suggests Google is plumbing runtime support for a new model family into its lightweight inference stack before, or alongside, a public release.

What the size rumors say

The parameter details are still rumor, not announcement. In one supporting post, the claim is that Gemma 4's biggest model will be "around 120B total" with "15B active" parameters; another post repeats "120b in total, 15b active parameters." If accurate, that would imply an MoE-style architecture where only a subset of parameters is active per token.

What is missing matters just as much. None of the evidence includes an official Google post, model card, context window, tokenizer details, benchmark table, quantization guidance, license update, or API availability. So the sizing rumor is useful as an early planning signal, but it does not yet answer deployability questions.

Why this still isn't a release

The strongest timing hint comes from the launch-week comment, which says Logan Kilpatrick called it "going to be a fun week of launches." Read together with the LiteRT-LM PR, that makes an imminent Gemma 4 reveal plausible.

But the evidence still describes a pre-release state. There are no published weights, no serving endpoints, and no reproducible evals attached to the leak. Right now the actionable facts are limited to a Google-linked LiteRT-LM PR mentioning "Gemma4" and "AIcore" NPU support, plus an unverified large-model sizing claim circulating in social posts.

TL;DR

What the PR actually shows

What the size rumors say

Why this still isn't a release

Discussion across the web