Multimodal
Systems that combine text, image, audio, video, or UI inputs.
Stories
Filter storiesPerceptron launched Mk1, a multimodal model for video and embodied reasoning with native 2 FPS video, 32K context, and structured spatial outputs. OpenRouter access and the low input price make it usable for deployment, not just demos.
Google unveiled Gemini Intelligence at the Android Show with cross-app task automation, Gemini in Chrome, Rambler voice cleanup, custom widgets, and AppFunctions. The rollout moves Gemini into core Android workflows on Pixel and Galaxy devices this summer.
Hugging Face released Diffusers 0.38.0 with new audio and image pipelines, Flash Attention 4, FlashPack loading, and Ring Anything for context parallelism. Use the new profiling guidance to tune diffusion performance as you adopt the added model coverage.
Thinking Machines previewed interaction models that process audio, video, and text in 200 ms micro-turns, letting the system listen, speak, and react at the same time. The demos matter because the interaction loop is trained into the model instead of stitched together from separate speech and tool layers.
OpenBMB released MiniCPM-V 4.6 1.3B, claiming 55.8 percent lower vision-encoding FLOPs, 75.7 ms TTFT on a 4090, and about 1.5x token throughput over Qwen3.5 0.8B. It targets edge deployment across mobile platforms and common inference stacks.
Zyphra released its first vision-language model, an 8B MoE with 700M active parameters and visual LoRA adapters. The model matters because it targets OCR, document reasoning, GUI interaction, and computer-use workloads under an Apache 2.0 license.
Google moved Gemini 3.1 Flash Lite from preview to GA, and OpenRouter added the model with 1 million context and low-cost multimodal pricing. The preview endpoint now has a shutdown schedule, and users should verify whether the GA model differs from the March preview.
Google added a redesigned edit mode to AI Studio Build with component selection, on-canvas annotation, and Nano Banana-generated image assets. The update makes AI Studio a more interactive app editor, so try it for iterative app tweaks instead of one-shot generation.
Google expanded Gemini API File Search to index text and images together, add custom metadata filtering, and return page-level citations. RAG builders can use it for tighter retrieval control and more auditable answers.
Moondream shipped Photon 1.2.0, expanding its inference engine to Apple Silicon, Windows CUDA, Blackwell, and Jetson Thor, then outlined how custom Metal kernels and fused ops made local vision practical without MLX. That broadens deployment options for edge and on-device vision workloads while keeping server-class latency on B200 systems.
DeepSeek briefly published a paper and threads on point-and-bbox reasoning, about 90 KV entries per 800² image, and RL-trained vision experts, then removed the repo and related mentions. The technique looked like a low-token path to computer use and multimodal reasoning in V4-Flash, but availability and reproducibility are now unclear.
Mistral shipped Medium 3.5 as a 128B dense model with 256K context, configurable reasoning, remote agents in Vibe, and Work Mode in Le Chat. The release broadens Mistral’s agent stack, though early comparisons question its price-performance against newer open rivals.
DeepSeek began rolling out Vision beta as a new image-understanding mode in Chat, and early testers reported fast OCR and strong object recognition. The rollout appears limited or staggered, so watch for broader access and formal docs before relying on it.
Nous added a built-in ComfyUI skill to Hermes Agent, letting the agent install, launch, and run Comfy workflows on demand through a `/comfyui` command. The integration turns the wider Comfy ecosystem into a callable agent surface instead of a separate manual pipeline.
NVIDIA opened Nemotron 3 Nano Omni, a 30B-A3B model for text, image, audio, and video, with day-one serving support. That lets teams run one open model for perception-heavy agents instead of stitching separate components.
Xiaomi opened MiMo-V2.5 and MiMo-V2.5-Pro under MIT, adding a 1M-context multimodal agent model and a 42B-active Pro variant. SGLang and vLLM published day-one recipes, making the series immediately deployable.
Alibaba launched Qwen-Image-2.0-Pro on ModelScope and API with better prompt adherence, multilingual typography, and steadier style quality. The model is aimed at text-heavy jobs like UI mockups and posters, so test it for layout-heavy generation.
BidirLM released a 2.5B multilingual encoder that embeds text, images, and audio into one shared 2048-dimensional space and works directly with Sentence Transformers. It tops several open-data embedding leaderboards and can run locally on GPU.
Alibaba released Qwen3.6-27B, a dense open model with multimodal input and thinking or non-thinking modes that beats Qwen3.5-397B-A17B across major coding benchmarks. Day-one support across vLLM, SGLang, Ollama, llama.cpp, GGUF, and MLX makes it ready for local and hosted coding agents.
A day after GPT Image 2 launched, developers and tool vendors posted reproducible workflows for floor plans, QR codes, conference posters, typography, and Figma-style asset generation. The follow-up matters because it shows where text-heavy visual generation is already usable, but also that quality depends heavily on mode choice, image size, and surrounding tool scaffolding.
Xiaomi’s MiMo-V2.5-Pro and MiMo-V2.5 arrived with million-token context windows, stronger coding and agentic claims, and immediate access through OpenRouter plus agent harnesses. The rollout adds another low-cost Chinese frontier model that engineers can route into coding workflows without waiting for a proprietary IDE deal.
OpenAI released GPT Image 2 in ChatGPT, Codex, and the API with thinking mode and 2K outputs. Early tests and Arena scores suggest it is usable for slides, UI mockups, and dense infographic layouts.
Google added Deep Research and Deep Research Max to the Gemini API with collaborative planning, multimodal inputs, MCP support, and native charts. The agents push cited web-plus-private-data reports into developer workflows, and Max is tuned for slower overnight runs.
Moonshot put Kimi K2.6 on API with cache-hit/cache-miss pricing, tool calls, JSON modes, and native text-image-video input. It also open-sourced FlashKDA and landed in Warp, Cosine, Genspark, and OpenClaw, making the launch usable coding-agent infrastructure.
Multiple Pro users said GPT-5.4 Pro started producing richer front-end and SVG outputs with much faster runtimes, despite no formal OpenAI announcement. The reports matter because they affect whether long visual and code-generation tasks are practical inside ChatGPT.
A weekend of Gemma 4 demos spanned YC hackathon projects, offline iPhone runs, and HN reports of strong local coding and SQL-agent performance. Gemma 4 is increasingly showing up as a practical edge model for tool use and multimodal apps, not just a release benchmark.
Anthropic launched Claude Design in research preview, turning prompts, files, and codebase context into prototypes, slides, and one-pagers. It can infer a team design system and export to Canva, PDF, or PPTX, or hand off to Claude Code.
Tencent released HY-World 2.0, a multimodal world model that turns text, images, or video into editable 3D worlds, and open-sourced WorldMirror 2.0 inference code and weights. Its four-stage pipeline targets reusable scene assets rather than single-view video clips.
Claude Opus 4.7 is now generally available across Claude, the API, and major clouds with xhigh effort, higher-resolution vision, and Claude Code review upgrades. Prompt behavior, tokenization, and effort defaults changed enough that existing harnesses may need retuning.
Alibaba open-sourced Qwen3.6-35B-A3B, a 35B multimodal sparse MoE with only 3B active parameters under Apache 2.0. Same-day support from vLLM, Ollama, SGLang, and GGUF builders makes it immediately usable for local and production coding workloads.
Google released Gemini 3.1 Flash TTS with inline Audio Tags, multi-speaker control and 70+ languages, and opened preview access through the Gemini API and AI Studio with rollout to Vertex AI and Google Vids. Independent evals ranked it near the top of current speech leaderboards, but it runs slower and costs more than the leading system.
Google DeepMind shipped Gemini Robotics-ER 1.6 to the Gemini API and AI Studio with better visual-spatial reasoning, multi-view success detection, and gauge reading. The model's 93% instrument-reading score targets robots that need to reason over cluttered scenes and physical constraints.
Sentence Transformers v5.4 adds one encode API for text, image, audio, and video, plus multimodal reranking and a modular CrossEncoder stack. It also flattens Flash Attention 2 inputs for text workloads, reducing padding waste and VRAM use.
Meta released Muse Spark, the first model from Meta Superintelligence Labs, with multimodal reasoning, tool use, and a parallel-agent Contemplating mode. Access stays limited to Meta AI and private API preview, so watch for broader availability before planning production use.
Users published reproducible 16 GB VRAM and Apple Silicon setups for the Gemma 4 26B-A4B and 31B variants. Google’s AI Gallery app also brought offline Gemma chat to phones. The setups make local coding and vision work more practical, but runtime choice, quantization, and recent llama.cpp regressions still affect reliability.
Google DeepMind released Gemma 4 in four open models with up to 256K context, multimodal inputs, and native tool-calling for local agent workflows. Day-0 support across serving stacks and benchmark wins make it ready for phones, laptops, and server GPUs.
Alibaba launched Qwen3.6-Plus with a 1M default context window, stronger coding and multimodal performance, and rollout across chat, API, and routing partners. Benchmarks and partner availability make it a new high-end option for agentic coding and web tasks.
Z.ai released GLM-5V-Turbo, a multimodal coding model for screenshots, video, design drafts, and GUI-agent tasks. It keeps text-coding performance steady while adding native vision support, so teams can test visual workflows without swapping models.
Google released Veo 3.1 Lite in Gemini API and AI Studio with 720p and 1080p output, 4-8 second clips, and text-to-video plus image-to-video support. Watch the April 7 Veo 3.1 Fast pricing drop if you need lower video generation costs.
Alibaba launched Qwen3.5-Omni across Lite, Flash, Plus, and Plus-Realtime variants for native text, image, audio, and video understanding, plus realtime voice controls and script-level captioning. The family targets long multimodal sessions and live interaction, so watch the understanding-focused limits if you need media generation.
Voxtral TTS uses separate semantic and acoustic token models, a 2.14 kbps codec, and 3-25 second reference audio for cloning across nine languages. Try it if you want a hybrid speech pipeline with more control and faster acoustic synthesis than all-autoregressive generation.
Sora says web and mobile access end on Apr. 26, with API access ending on Sep. 24. Teams now have a fixed migration window, but bulk export still appears unavailable.
SAM 3.1 is a drop-in update that shares video computation across up to 16 tracked objects instead of rerunning most of the model per object. Meta's H100 numbers show roughly 30 FPS at 16 objects versus under 10 FPS for SAM 3, which cuts multi-object video tracking cost.
Google launched Gemini 3.1 Flash Live in AI Studio, the API, and Gemini Live with stronger audio tool use, lower latency, and 128K context. Voice-agent teams should benchmark quality, latency, and thinking settings before switching.
Lyria 3 Pro and Lyria 3 Clip are now in Gemini API and AI Studio, with Lyria 3 Pro priced at $0.08 per song and able to structure tracks into verses and choruses. That gives developers a clearer path to longer-form music features, with watermarking and prompt design built in.
MiniMax introduced a flat-rate Token Plan that covers text, speech, music, video, and image APIs under one subscription. It gives teams one predictable bill across modalities and can be used in third-party harnesses, not just MiniMax apps.
Skyler Miao said MiniMax M2.7 open weights are due in roughly two weeks, with updates tuned for agent tasks. Separate replies also confirm multimodal M3, so local-stack builders should watch both the drop and the benchmark setup.
Physical Intelligence says its RL token compresses VLA state into a lightweight signal that an on-robot actor-critic can adapt in minutes. This matters for last-millimeter manipulation, where full-size models are often too slow or too coarse to tune online.
Mistral Small 4 combines reasoning and non-reasoning modes in one 119B MoE, adds native image input, and expands context to 256K at $0.15/$0.6 per million tokens. It improves sharply over Small 3.2, but still trails similarly sized open peers on several evals.
Google extended its OpenAI compatibility layer so existing OpenAI SDK code can call Veo 3.1 video generation and Gemini image models with only base URL and model changes. It lowers migration cost for teams that want multimodal fallbacks without rewriting client code.