Skip to content
AI Primer
release

OpenBMB releases MiniCPM-V 4.6 1.3B with 75.7 ms TTFT and 19x token efficiency

OpenBMB released MiniCPM-V 4.6 1.3B, claiming 55.8 percent lower vision-encoding FLOPs, 75.7 ms TTFT on a 4090, and about 1.5x token throughput over Qwen3.5 0.8B. It targets edge deployment across mobile platforms and common inference stacks.

3 min read
OpenBMB releases MiniCPM-V 4.6 1.3B with 75.7 ms TTFT and 19x token efficiency
OpenBMB releases MiniCPM-V 4.6 1.3B with 75.7 ms TTFT and 19x token efficiency

TL;DR

You can grab the Hugging Face weights, browse the GitHub repo, compare the model on Artificial Analysis' tiny open-source leaderboard, and spin up the web demo. The interesting bit is how much of the launch is framed around compression knobs and deployment surfaces, not just raw scores.

LLaVA-UHD v4

OpenBMB's pitch is mostly an inference story. OpenBMB's LLaVA-UHD v4 note says MiniCPM-V 4.6 cuts vision-encoding FLOPs by 55.8% without degrading performance, using two specific changes:

  • Intra-ViT early compression
  • Hybrid 4x and 16x visual compression in one model

That second knob matters because the model card and benchmark table surfaced in mervenoyann's benchmark screenshot show separate 4x and 16x results for the same checkpoint, instead of locking users into one visual token ratio.

Latency and token budget

OpenBMB's own numbers center on three efficiency claims:

  1. 75.7 ms TTFT on a 4090 at 3136² resolution
  2. About 1.5x token throughput over Qwen3.5-0.8B on a 4090
  3. 19x fewer tokens than Qwen3.5-0.8B, and 43x fewer than Qwen3.5-0.8B-Thinking, on the Artificial Analysis benchmark

The outside validation is directionally similar. In ArtificialAnlys' model summary, Artificial Analysis says the model scores 13 on its Intelligence Index, uses 5.4M output tokens, and lands as the highest-scoring open-weights model under 2B parameters in that benchmark set. The same thread also flags the tradeoff: knowledge recall is weak, with AA-Omniscience at -85, which ArtificialAnlys' full-results post says is in line with other sub-2B non-reasoning models.

Edge deployment

The launch thread is unusually concrete about where the model is meant to run. OpenBMB's developer support post lists:

  • iOS, Android, and HarmonyOS deployment
  • Fine-tuning on consumer GPUs via SWIFT and LLaMA-Factory
  • Compatibility with SGLang, vLLM, llama.cpp, and Ollama
  • Quantized releases in GGUF, BNB, AWQ, and GPTQ

That package makes this feel more like a shipping edge stack than a research checkpoint. The MiniCPM-V GitHub repo and MiniCPM-V-Apps repo are both linked directly from the launch thread.

Day-0 runtimes

The ecosystem response was fast. lmsysorg's SGLang post says SGLang added day-0 support and includes a deployment screenshot with generated serve flags, including --tool-call-parser qwen and a Mamba radix cache option.

OpenBMB also highlighted Apple Silicon throughput. In OpenBMB's MLX-VLM thank-you post, the team cites 125 tok/s at full precision on an M3 Max via MLX-VLM, which is the most concrete non-NVIDIA runtime number in the evidence set.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads
TL;DR2 posts
Latency and token budget2 posts
Share on X