Systems that combine text, image, audio, video, or UI inputs.
Mistral's Voxtral TTS splits speech into semantic and acoustic tokens, uses a low-bitrate codec, and claims a 68.4% win rate over ElevenLabs Flash v2.5 on voice cloning with about 3-25 seconds of reference audio. The architecture targets multilingual cloning and higher-quality speech without a fully autoregressive audio stack, so voice teams should compare it against current TTS pipelines.
Sora says web and mobile access end on Apr. 26, with API access ending on Sep. 24. Teams now have a fixed migration window, but bulk export still appears unavailable.
SAM 3.1 is a drop-in update that shares video computation across up to 16 tracked objects instead of rerunning most of the model per object. Meta's H100 numbers show roughly 30 FPS at 16 objects versus under 10 FPS for SAM 3, which cuts multi-object video tracking cost.
Google launched Gemini 3.1 Flash Live in AI Studio, the API, and Gemini Live with stronger audio tool use, lower latency, and 128K context. Voice-agent teams should benchmark quality, latency, and thinking settings before switching.
Lyria 3 Pro and Lyria 3 Clip are now in Gemini API and AI Studio, with Lyria 3 Pro priced at $0.08 per song and able to structure tracks into verses and choruses. That gives developers a clearer path to longer-form music features, with watermarking and prompt design built in.
MiniMax introduced a flat-rate Token Plan that covers text, speech, music, video, and image APIs under one subscription. It gives teams one predictable bill across modalities and can be used in third-party harnesses, not just MiniMax apps.
Skyler Miao said MiniMax M2.7 open weights are due in roughly two weeks, with updates tuned for agent tasks. Separate replies also confirm multimodal M3, so local-stack builders should watch both the drop and the benchmark setup.
Physical Intelligence says its RL token compresses VLA state into a lightweight signal that an on-robot actor-critic can adapt in minutes. This matters for last-millimeter manipulation, where full-size models are often too slow or too coarse to tune online.
Mistral Small 4 combines reasoning and non-reasoning modes in one 119B MoE, adds native image input, and expands context to 256K at $0.15/$0.6 per million tokens. It improves sharply over Small 3.2, but still trails similarly sized open peers on several evals.
Google extended its OpenAI compatibility layer so existing OpenAI SDK code can call Veo 3.1 video generation and Gemini image models with only base URL and model changes. It lowers migration cost for teams that want multimodal fallbacks without rewriting client code.
Datalab-to open-sourced Chandra OCR 2, a 4B document model with repo, weights, demo, and CLI quickstart, and claims state-of-the-art 85.9 on olmOCR Bench. It gives document pipelines a practical multilingual OCR option that can run with local tooling instead of only hosted APIs.
H Company launched Holotron-12B, an open multimodal model for computer-use agents built on a hybrid SSM-attention stack that targets KV-cache bottlenecks. Benchmark it if you need high-concurrency browser agents and want better throughput without giving up web-task accuracy.
Dreamverse paired Hao AI Lab's FastVideo stack with an interface for editing video scenes in a faster-than-playback loop, using quantization and fused kernels to keep latency below viewing time. The stack is interesting if you are building real-time multimodal generation or multi-user video serving.
Mistral shipped Mistral Small 4, a 119B MoE model with 6.5B active parameters, multimodal input, configurable reasoning, and Apache 2.0 weights. Deploy it quickly in existing stacks if you use SGLang or vLLM, which added day-one support.
UT Austin researchers report that simple sequential fine-tuning with LoRA and on-policy RL can retain prior skills while learning new VLA tasks. Try this baseline before reaching for more complex continual-learning methods.
Markov AI released Computer Use Large on Hugging Face with 48,478 screen recordings spanning about 12,300 hours across six professional apps. Use it to train and evaluate GUI agents on real software workflows with a large CC-BY dataset.
FastVideo published an LTX-2.3 inference stack that claims 5-second 1080p text-image-to-audio-video generation in 4.55 seconds on a single GPU. If the results hold up, test it for lower-cost interactive video generation and faster iteration loops.
Vals published a benchmark pass for Grok 4.20 Beta showing gains on coding, math, multimodal, and Terminal Bench 2, alongside weaker legal-task results. Check task-level results before adopting it, especially if legal workflows matter more than headline benchmark gains.
Mixedbread introduced Wholembed v3 as a retrieval model for text, image, video, audio, and multilingual search. Benchmark it on fine-grained retrieval tasks if single-vector embeddings have been collapsing in your pipeline.
Google launched Gemini Embedding 2 in preview, unifying multiple modalities and 100+ languages in one embedding space with flexible output dimensions. Try it to simplify cross-modal RAG and search pipelines, but compare it with late-interaction systems before committing.
Google put Gemini Embedding 2 into public preview with one vector space for text, images, video, audio, and PDFs, plus 3072, 1536, and 768 output sizes. Use it to replace multi-model retrieval pipelines with one API for RAG and cross-media search.
OpenAI rolled out interactive visual explanations for more than 70 math and science concepts in ChatGPT. Try it for education products or internal learning workflows that benefit from manipulable models instead of static tutoring.