releaseMay 1, 2026

Moondream releases Photon 1.2.0 with Apple Silicon, native Windows CUDA, and 23 ms B200 latency

Moondream shipped Photon 1.2.0, expanding its inference engine to Apple Silicon, Windows CUDA, Blackwell, and Jetson Thor, then outlined how custom Metal kernels and fused ops made local vision practical without MLX. That broadens deployment options for edge and on-device vision workloads while keeping server-class latency on B200 systems.

5 min read

Moondream releases Photon 1.2.0 with Apple Silicon, native Windows CUDA, and 23 ms B200 latency

TL;DR

Moondream shipped Photon 1.2.0 with moondreamai's launch post adding Apple Silicon, native Windows CUDA, NVIDIA Blackwell, and Jetson Thor support, while the official launch post frames it as a broader push to run vision inference locally across laptops, edge devices, and server GPUs.
On Blackwell, moondreamai's B200 benchmark post put single-request latency at about 23 ms for Moondream 2 and 30 ms for Moondream 3, and the launch post says batch-64 throughput on B200 is 1.49x and 1.23x faster than H100 for those two models.
Apple Silicon support did not come through MLX: according to vikhyatk's architecture note, Photon is too tied to a roughly 15,000 line PyTorch and Rust runtime, so the team wrote custom Metal kernels instead of maintaining a second runtime.
vikhyatk's token-sampling example gives the clearest datapoint on that port, cutting one decode path from 687 microseconds to 130 microseconds by fusing 14 torch ops into one Metal kernel.
The practical pitch is edge-first vision, with vikhyatk's edge-inference post arguing that image upload time and privacy constraints often matter more than raw accelerator speed, and martinbowling's on-prem report describing pilot deployments on Apple Silicon.

You can inspect the product page, the launch post, and the local runtime docs. The weirdly useful reveal came a few hours after launch, when vikhyatk's Apple Silicon thread explained why the Mac port avoided MLX, and vikhyatk's RTX PRO 6000 screenshot added a workstation datapoint the main launch thread barely mentioned.

Hardware matrix

Photon 1.2.0 changes the deployment map more than the raw version number suggests. The launch post adds five concrete buckets:

Apple Silicon: native inference on M-series Macs
Windows x86_64: native CUDA inference, no WSL required
NVIDIA Blackwell: B200 and RTX PRO 6000
Jetson Thor: JetPack 7 and CUDA 13
Existing NVIDIA GPUs: faster prefill, MoE, dispatch, and tail latency

The install path stayed simple. Both the launch post and moondreamai's blog-link post use the same pip install moondream plus local=True flow, with Photon running inference on the local machine instead of the hosted API.

Apple Silicon

The Mac port is a custom Metal job, not an MLX integration. vikhyatk's MPS note says plain PyTorch-on-Metal was too slow because each op paid about 100 microseconds of host overhead, while the launch post says Photon now uses native Metal kernels across paged attention, rotary embeddings, KV cache management, MoE routing, sampling, and layer norm.

The implementation detail that matters is fusion. In vikhyatk's token-sampling example, a 14-op sampling path became one Metal kernel, dropping per-token time from 687 microseconds to 130 microseconds.

The reference Mac numbers are modest but usable. The launch post reports 7.26 requests/sec for Moondream 2 and 4.58 requests/sec for Moondream 3 on an M5 Max MacBook Pro at batch size 4, plus 0.79 and 0.55 requests/sec on an M2 Mac mini. vikhyatk's latency argument says that tradeoff is fine for interactive vision workloads where network round trips dominate wall-clock latency.

Blackwell and Jetson Thor

The server and edge ends of the release both got hard numbers.

B200, Moondream 2: about 23 ms single-request latency, 93.61 requests/sec at batch 64, per the launch post
B200, Moondream 3: about 30 ms single-request latency, 71.27 requests/sec at batch 64, per the launch post
B200 vs H100: 1.49x faster on Moondream 2 and 1.23x faster on Moondream 3 at batch 64, per the launch post
RTX PRO 6000: 39.3 requests/sec on Moondream 2 and 39.7 requests/sec on Moondream 3 at batch 64, according to vikhyatk's RTX PRO 6000 screenshot
Jetson AGX Thor: about 152 ms latency for Moondream 2 and 147 ms for Moondream 3, with 14.53 and 12.05 requests/sec at batch 64, per the launch post

Under the hood, the launch post attributes the Blackwell gains to Blackwell-specific MoE kernels and dedicated flash-attention kernels for decode and prefill. On Jetson, the quieter detail is packaging: Photon now ships a multi-CUDA Linux aarch64 wheel that auto-selects CUDA 13 for Thor and CUDA 12 for Orin and GH200.

Engine internals

Photon's speed story is not just kernels. vikhyatk's engine-level post and the Photon page both describe a bigger inference stack built around request scheduling, native image processing, prefix caching, automatic batching, and a paged KV cache.

That architecture explains why the Apple Silicon port went the way it did. vikhyatk's architecture note says the existing runtime already bundled the scheduler, KV manager, radix-tree prefix caching, LoRA support, image pipeline, and skill state machines, which made an MLX rewrite look like permanent double maintenance.

The result is a local runtime aimed at more than demos. martinbowling's on-prem report mentions pilot projects for self-checkout and warehouse delivery tracking on Apple Silicon, which is about as concrete as early hands-on feedback gets for a vision stack.

Docs drift

One small mismatch is still visible in public docs. The local runtime docs that Exa indexed describe Photon as an NVIDIA GPU runtime for Ampere-or-newer hardware, while moondreamai's launch post and the May 1 launch post clearly add Apple Silicon support on macOS 13 plus Python 3.12.

That leaves the blog post as the canonical source for the new matrix today, and the docs page as a lagging snapshot of the pre-1.2.0 world.