Skip to content
AI Primer
TOPIC50 stories

Multimodal

Systems that combine text, image, audio, video, or UI inputs.

NEWS26th June
Chandra reports Mistral OCR 4 scores are not reproducible and publishes repro scripts

Chandra's developer said Mistral OCR 4 launch numbers for both Chandra and OCR 4 could not be reproduced with public code, and published scripts to show the gaps. The dispute matters because Mistral OCR 4 launched on leaderboard claims, and benchmark settings now directly affect model selection.

RELEASE26th June
Perceptron adds video_frames to Mk1 and cuts 1080p time-to-first-token from ~42s to ~4s

Perceptron launched a video_frames input for Mk1 that accepts pre-decoded frames with timestamps instead of forcing clip re-encoding. The change matters for edge and sparse-footage pipelines because 10 minutes of 1080p video can start returning tokens roughly ten times faster.

RELEASE25th June
Seedance 2.0 Mini launches on Venice, ComfyUI, and Pika MCP with 15s 720p video

A day after Seedance 2.0's 4K rollout story, partners began shipping the cheaper Seedance 2.0 Mini across Venice, ComfyUI, and Pika MCP. The 15-second 720p variant with native audio gives video workflows a lower-cost path than the flagship model.

NEWS24th June
Seedance 2.0 adds native 4K as fal, Replicate, Pika MCP, and ComfyUI ship support

Seedance 2.0 rolled out native 4K generation while Seedance 2.0 Mini landed on fal, Replicate, Pika MCP, and ComfyUI. That matters because engineers can now reach the same video model family through APIs, MCP workflows, and local graph tooling instead of a single web surface.

RELEASE24th June
Baidu releases Unlimited OCR with 3B params for single-pass long documents

Baidu released Unlimited OCR as an open-source long-document OCR model with 3B total parameters and 500M active at inference. Early ParseBench testing says it is strong on tables and reading order but weaker on semantic formatting and charts, giving teams a new open-weight OCR option with clear tradeoffs.

RELEASE24th June
OpenRouter launches Image API with typed capabilities and exact USD cost

OpenRouter released a dedicated Image API that normalizes request shapes across 30-plus models from eight providers. Agents can inspect limits, passthrough options, streaming, and exact per-call cost without hardcoding vendor quirks.

RELEASE23rd June
Mistral releases OCR 4 with bounding boxes and 85.20 OlmOCRBench

Mistral OCR 4 adds layout-aware extraction with bounding boxes, block typing, and inline confidence across 170 languages. Use it through the API or self-hosted deployments when document pipelines need structure, citations, redaction, and chunking.

RELEASE23rd June
Perceptron releases Files API with reusable upload IDs

Perceptron’s Files API lets developers upload an image or video once and reference it by ID across later requests instead of resending base64 or URLs. That simplifies repeated multimodal workflows and cuts transfer overhead for video-heavy pipelines.

RELEASE22nd June
Google ships Interactions API in GA as Gemini default with background agents

Google put the Interactions API into GA as the new default for Gemini, adding background execution, managed agents, remote sandboxes, and multimodal tools. Builders now get one stateful interface for models, long-running jobs, and future Gemini Omni support.

RELEASE1w ago
lift-pdf releases 9B extractor with 90.2% accuracy and 9.5s p50

lift-pdf released an open-source 9B model for schema-constrained document extraction, with code, pip install, playground access, and a 90.2% score on the team's 225-document bench. It matters because the model claims near-Gemini 3.5 Flash accuracy at 9.5s p50, though coverage is still skewed toward Latin-language docs and commercial-use limits remain.

RELEASE1w ago
Moonshot releases Kimi K2.7 Code HighSpeed at 180 tok/s with 2x API pricing

Moonshot rolled out HighSpeed for Kimi K2.7 Code, claiming about 180 tok/s on coding tasks, up to 260 tok/s on shorter contexts, and roughly 6x speedups. Watch the tight capacity limits and mixed benchmark results, and budget for the 2x pricing if you want the faster mode.

RELEASE1w ago
ElevenAPI launches Music v2 with inpainting and 15¢-per-minute pricing

ElevenLabs launched Music v2 on ElevenAPI with track generation, reference matching, inpainting, and multilingual output. It gives developers a priced API for commercial music creation and section-level editing.

RELEASE2w ago
Zyphra releases ZONOS2: 8B sparse MoE TTS with zero-shot voice cloning

Zyphra released ZONOS2 under Apache 2.0 with 8B total parameters, 900M active, zero-shot voice cloning, 44.1 kHz DAC audio, and ZTTS1-Eval. The release includes open weights, inference code, and eval code, so teams can run real-time multilingual TTS without a hosted-only stack.

RELEASE2w ago
MiniMax opens M3 weights: 428B total, 23B active, 1M context

MiniMax published M3 weights on Hugging Face with 428B total parameters, 23B active parameters, 1M context, and multimodal support. Unsloth quickly added local GGUF builds, so teams can try 2-bit runs at 138GB RAM or VRAM and 3-bit at 165GB.

RELEASE2w ago
Google launches Gemini 3.5 Live Translate for 70+ languages

Google released Gemini 3.5 Live Translate for low-latency speech translation across 70+ languages in the Gemini Live API, AI Studio, and Google Translate. The same model is also heading to Google Meet in private preview for Workspace customers.

NEWS2w ago
Apple Intelligence adds Gemini-backed Siri beta with visual and on-screen understanding

Posts from WWDC say Apple Intelligence now combines Apple Foundation and Gemini models, and Siri gains visual, on-screen, and app-level actions. Watch for the beta rollout later this year; multiple posts say it will not ship in the EU at launch.

RELEASE3w ago
Gemma 4 12B ships encoder-free multimodal local model with 16GB target and 256K context

Google released Gemma 4 12B, an Apache 2.0 encoder-free multimodal model with native audio and vision for 16GB-class laptops. Day-zero support in llama.cpp, vLLM, Ollama, MLX, and SGLang should make local agents and on-device apps easier to deploy immediately.

NEWS3w ago
Hyper, OpenCode, Kilo, and Vals add Qwen 3.7 Plus support within 72 hours

Two days after Qwen 3.7 Plus launched, Hyper, OpenCode, Kilo, and Vals shipped support or rankings around the 1M-context multimodal model. The rapid pickup shows Alibaba’s new model landing quickly in coding-agent tools and public eval stacks outside its own platform.

RELEASE3w ago
Microsoft launches MAI-Thinking-1 and six companion models with 97.0% AIME 2025

Microsoft introduced MAI-Thinking-1, MAI-Code-1-Flash, and five other MAI models across code, image, voice, and speech. The launch puts Microsoft back into the frontier-model race and starts landing pieces of the stack in Copilot and partner runtimes.

RELEASE3w ago
H Company launches Holo 3.1 with local computer use and 79.3% AndroidWorld

H Company released Holo 3.1, a local computer-use VLM family with function calling and AndroidWorld gains up to 79.3% on the 35B model. The update pushes computer-use agents toward local and mobile deployment instead of cloud-only runtimes.

RELEASE3w ago
NVIDIA launches Cosmos 3 open 16B and 64B omnimodels with datasets and SGLang support

NVIDIA released Cosmos 3 as an open omnimodel family with 16B and 64B variants, plus code, datasets, and a coalition around physical AI. The release matters because it ships with serving support and top open-weight image and video rankings, so teams can use it beyond a research teaser.

RELEASE3w ago
Qwen releases Qwen 3.7 Plus with multimodal agent mode and browser demos

Alibaba released Qwen 3.7 Plus as a multimodal agent model for GUI, CLI, coding, and browser tasks. It ships with browser demos and immediate Cline support, giving teams another frontier-style agent model to compare against M3 and closed-source tools.

RELEASE4w ago
MiniMax M3 launches with 1M context and 59.0 SWE-Bench Pro

MiniMax shipped M3 with a 1M-token context window, native multimodal input, and frontier coding claims across SWE-Bench Pro, Terminal Bench, and MCP Atlas. It also appeared on OpenRouter, Ollama Cloud, Venice, Hermes, Cline, Together, and Arena on day one.

NEWS4w ago
Grok Imagine Video 1.5 adds fal and Venice API access after xAI rollout

Grok Imagine Video 1.5 moved from arena ranking to usable APIs, with xAI docs live and third-party access on fal and Venice. That matters because developers can now script against the model through standard providers, though early #1 arena claims are already being challenged by side-by-side testers.

RELEASE4w ago
Step 3.7 Flash opens 30-day free access for Hermes users via Nous Portal

A day after launch, Nous made Step 3.7 Flash free for 30 days to Hermes users through Nous Portal. The access window landed alongside fresh vLLM/NIM and MLX-VLM support, making the model easier to test in both local and production stacks.

RELEASE4w ago
Grok Imagine Video 1.5 Preview ranks #1 in Image-to-Video Arena at $0.14 for 720p

Grok Imagine Video 1.5 Preview took the top 720p Image-to-Video Arena slot with a reported 52-point gain over the previous Grok video model. xAI docs and shared console pricing put the model at $0.08 for 480p and $0.14 for 720p, giving developers a concrete new API option for video generation.

NEWS4w ago
Step 3.7 Flash launches with day-one support in Kilo, Modal, SGLang, Hermes, and DesignArena

Step 3.7 Flash landed immediately across Kilo, Modal, SGLang, Hermes-linked tooling, and DesignArena as the model’s 198B MoE, 256K-context release spread through the stack. The breadth of day-one support gives engineers multiple ways to serve, benchmark, and wire the new open-weight multimodal model into agents.

RELEASE4w ago
Google makes Nano Banana 2 and Nano Banana Pro GA with video input and $0.045/$0.134 pricing

Google moved Nano Banana 2 and Nano Banana Pro to GA in AI Studio and the Gemini Enterprise Agent Platform. Nano Banana 2 also takes video as input, giving image pipelines published per-image pricing and a production API.

NEWS4w ago
SynthID adds OpenAI, ElevenLabs, and Kakao partners as Search and Chrome gain verification

Google expanded SynthID with new model partners and pushed verification into Search, Chrome, and Pixel video provenance flows. That matters because AI-content authentication is moving from isolated model outputs into mainstream browser and distribution surfaces.

RELEASE1mo ago
Cohere releases Command A+ under Apache 2.0 with 25B active params and 2x H100 deployment

Cohere open-sourced Command A+, a 218B MoE multimodal model with 25B active parameters, 48-language support, and deployment starting at two H100s. Artificial Analysis put it at 37 on its Intelligence Index and 281 tok/s, and vLLM plus Transformers added support.

RELEASE1mo ago
Gemini Omni Flash launches video-to-video edits and Google Flow rollout

Google launched Gemini Omni Flash as its first shipping any-input-to-video model, with character consistency, physics-aware scenes, and conversational video editing. Use it in Gemini, Flow, and YouTube surfaces first, and wait for API access if you need programmatic integration.

NEWS1mo ago
Gemini desktop leaks Stream to Cursor, Spark local files, and Omni ahead of I/O

Leak videos and tester reports pointed to a larger Gemini desktop app with Stream to Cursor, Spark local-file access, Live, and Omni ahead of I/O. Independent testers also reported faster 3.2 and 3.5 Flash checkpoints, but Google had not announced the features publicly.

NEWS1mo ago
Qwen opens 3.7 Max Preview and Plus Preview on Arena with a #10 coding rank

Alibaba put Qwen3.7 Max Preview and Qwen3.7 Plus Preview live on Arena and the Qwen site, with Arena placing Max Preview #13 overall and #10 for coding. That gives engineers an early read on the next Qwen generation before any broader API or open-weight release.

RELEASE1mo ago
Perceptron releases Mk1 with 2 FPS video reasoning, 32K context, and $0.15 per 1M input

Perceptron launched Mk1, a multimodal model for video and embodied reasoning with native 2 FPS video, 32K context, and structured spatial outputs. OpenRouter access and the low input price make it usable for deployment, not just demos.

NEWS1mo ago
Google introduces Gemini Intelligence on Android with browser use, AppFunctions, and Rambler

Google unveiled Gemini Intelligence at the Android Show with cross-app task automation, Gemini in Chrome, Rambler voice cleanup, custom widgets, and AppFunctions. The rollout moves Gemini into core Android workflows on Pixel and Galaxy devices this summer.

RELEASE1mo ago
Diffusers 0.38.0 adds Ace-Step 1.5 pipelines and Flash Attention 4 support

Hugging Face released Diffusers 0.38.0 with new audio and image pipelines, Flash Attention 4, FlashPack loading, and Ring Anything for context parallelism. Use the new profiling guidance to tune diffusion performance as you adopt the added model coverage.

RELEASE1mo ago
Thinking Machines introduces interaction models with 200 ms full-duplex audio, video, and tool use

Thinking Machines previewed interaction models that process audio, video, and text in 200 ms micro-turns, letting the system listen, speak, and react at the same time. The demos matter because the interaction loop is trained into the model instead of stitched together from separate speech and tool layers.

RELEASE1mo ago
OpenBMB releases MiniCPM-V 4.6 1.3B with 75.7 ms TTFT and 19x token efficiency

OpenBMB released MiniCPM-V 4.6 1.3B, claiming 55.8 percent lower vision-encoding FLOPs, 75.7 ms TTFT on a 4090, and about 1.5x token throughput over Qwen3.5 0.8B. It targets edge deployment across mobile platforms and common inference stacks.

RELEASE1mo ago
Zyphra releases ZAYA1-VL-8B with 700M active params and Apache 2.0

Zyphra released its first vision-language model, an 8B MoE with 700M active parameters and visual LoRA adapters. The model matters because it targets OCR, document reasoning, GUI interaction, and computer-use workloads under an Apache 2.0 license.

RELEASE1mo ago
Google releases Gemini 3.1 Flash Lite GA with 1M context and $0.25 input pricing

Google moved Gemini 3.1 Flash Lite from preview to GA, and OpenRouter added the model with 1 million context and low-cost multimodal pricing. The preview endpoint now has a shutdown schedule, and users should verify whether the GA model differs from the March preview.

NEWS1mo ago
AI Studio adds edit mode and Nano Banana image assets

Google added a redesigned edit mode to AI Studio Build with component selection, on-canvas annotation, and Nano Banana-generated image assets. The update makes AI Studio a more interactive app editor, so try it for iterative app tweaks instead of one-shot generation.

NEWS1mo ago
Gemini API adds multimodal File Search with page citations

Google expanded Gemini API File Search to index text and images together, add custom metadata filtering, and return page-level citations. RAG builders can use it for tighter retrieval control and more auditable answers.

RELEASE1mo ago
Moondream releases Photon 1.2.0 with Apple Silicon, native Windows CUDA, and 23 ms B200 latency

Moondream shipped Photon 1.2.0, expanding its inference engine to Apple Silicon, Windows CUDA, Blackwell, and Jetson Thor, then outlined how custom Metal kernels and fused ops made local vision practical without MLX. That broadens deployment options for edge and on-device vision workloads while keeping server-class latency on B200 systems.

RELEASE1mo ago
DeepSeek removes visual-primitives repo after 90-KV vision details

DeepSeek briefly published a paper and threads on point-and-bbox reasoning, about 90 KV entries per 800² image, and RL-trained vision experts, then removed the repo and related mentions. The technique looked like a low-token path to computer use and multimodal reasoning in V4-Flash, but availability and reproducibility are now unclear.

RELEASE2mo ago
Mistral releases Medium 3.5 with 128B weights, 256K context, and Work Mode

Mistral shipped Medium 3.5 as a 128B dense model with 256K context, configurable reasoning, remote agents in Vibe, and Work Mode in Le Chat. The release broadens Mistral’s agent stack, though early comparisons question its price-performance against newer open rivals.

WORKFLOW2mo ago
Hermes Agent adds ComfyUI skill with `/comfyui` workflow installs and local/cloud control

Nous added a built-in ComfyUI skill to Hermes Agent, letting the agent install, launch, and run Comfy workflows on demand through a `/comfyui` command. The integration turns the wider Comfy ecosystem into a callable agent surface instead of a separate manual pipeline.

RELEASE2mo ago
DeepSeek releases Vision beta for image understanding in DeepSeek Chat

DeepSeek began rolling out Vision beta as a new image-understanding mode in Chat, and early testers reported fast OCR and strong object recognition. The rollout appears limited or staggered, so watch for broader access and formal docs before relying on it.

RELEASE2mo ago
Nemotron 3 Nano Omni launches 30B-A3B multimodal model with 256K context

NVIDIA opened Nemotron 3 Nano Omni, a 30B-A3B model for text, image, audio, and video, with day-one serving support. That lets teams run one open model for perception-heavy agents instead of stitching separate components.

RELEASE2mo ago
MiMo-V2.5 opens under MIT with 1M context and SGLang vLLM support

Xiaomi opened MiMo-V2.5 and MiMo-V2.5-Pro under MIT, adding a 1M-context multimodal agent model and a 42B-active Pro variant. SGLang and vLLM published day-one recipes, making the series immediately deployable.

RELEASE2mo ago
Qwen-Image-2.0-Pro launches at #9 on Arena with multilingual text rendering

Alibaba launched Qwen-Image-2.0-Pro on ModelScope and API with better prompt adherence, multilingual typography, and steadier style quality. The model is aimed at text-heavy jobs like UI mockups and posters, so test it for layout-heavy generation.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.