OpenBMB releases MiniCPM-V 4.6 1.3B with 75.7 ms TTFT and 19x token efficiency
OpenBMB released MiniCPM-V 4.6 1.3B, claiming 55.8 percent lower vision-encoding FLOPs, 75.7 ms TTFT on a 4090, and about 1.5x token throughput over Qwen3.5 0.8B. It targets edge deployment across mobile platforms and common inference stacks.

TL;DR
- OpenBMB shipped MiniCPM-V 4.6 as a 1.3B vision-language model aimed at edge deployment, and OpenBMB's launch thread says it handles high-res images while staying small enough for mobile and consumer hardware.
- According to OpenBMB's LLaVA-UHD v4 note, the main architectural change is a 55.8% cut in vision-encoding FLOPs through early compression plus switchable 4x and 16x visual compression.
- In OpenBMB's latency post, the team claims 75.7 ms TTFT on a single RTX 4090 with 3136² images, plus roughly 1.5x token throughput over Qwen3.5-0.8B.
- OpenBMB's token-efficiency post says MiniCPM-V 4.6 used 19x fewer tokens than Qwen3.5-0.8B, while ArtificialAnlys' model summary puts it at 5.4M output tokens for an Intelligence Index score of 13.
- Deployment is already spreading across common local stacks: OpenBMB's developer support post lists vLLM, llama.cpp, Ollama, and quantized formats, while lmsysorg's SGLang post and OpenBMB's MLX-VLM thank-you post add day-0 SGLang and MLX-VLM support.
You can grab the Hugging Face weights, browse the GitHub repo, compare the model on Artificial Analysis' tiny open-source leaderboard, and spin up the web demo. The interesting bit is how much of the launch is framed around compression knobs and deployment surfaces, not just raw scores.
LLaVA-UHD v4
OpenBMB's pitch is mostly an inference story. OpenBMB's LLaVA-UHD v4 note says MiniCPM-V 4.6 cuts vision-encoding FLOPs by 55.8% without degrading performance, using two specific changes:
- Intra-ViT early compression
- Hybrid 4x and 16x visual compression in one model
That second knob matters because the model card and benchmark table surfaced in mervenoyann's benchmark screenshot show separate 4x and 16x results for the same checkpoint, instead of locking users into one visual token ratio.
Latency and token budget
OpenBMB's own numbers center on three efficiency claims:
- 75.7 ms TTFT on a 4090 at 3136² resolution
- About 1.5x token throughput over Qwen3.5-0.8B on a 4090
- 19x fewer tokens than Qwen3.5-0.8B, and 43x fewer than Qwen3.5-0.8B-Thinking, on the Artificial Analysis benchmark
The outside validation is directionally similar. In ArtificialAnlys' model summary, Artificial Analysis says the model scores 13 on its Intelligence Index, uses 5.4M output tokens, and lands as the highest-scoring open-weights model under 2B parameters in that benchmark set. The same thread also flags the tradeoff: knowledge recall is weak, with AA-Omniscience at -85, which ArtificialAnlys' full-results post says is in line with other sub-2B non-reasoning models.
Edge deployment
The launch thread is unusually concrete about where the model is meant to run. OpenBMB's developer support post lists:
- iOS, Android, and HarmonyOS deployment
- Fine-tuning on consumer GPUs via SWIFT and LLaMA-Factory
- Compatibility with SGLang, vLLM, llama.cpp, and Ollama
- Quantized releases in GGUF, BNB, AWQ, and GPTQ
That package makes this feel more like a shipping edge stack than a research checkpoint. The MiniCPM-V GitHub repo and MiniCPM-V-Apps repo are both linked directly from the launch thread.
Day-0 runtimes
The ecosystem response was fast. lmsysorg's SGLang post says SGLang added day-0 support and includes a deployment screenshot with generated serve flags, including --tool-call-parser qwen and a Mamba radix cache option.
OpenBMB also highlighted Apple Silicon throughput. In OpenBMB's MLX-VLM thank-you post, the team cites 125 tok/s at full precision on an M3 Max via MLX-VLM, which is the most concrete non-NVIDIA runtime number in the evidence set.