Multimodal
Stories, products, and related signals connected to this tag in Explore.
Stories
Filter storiesA sponsored explainer thread described Speech Engine as a WebSocket layer that adds speech-to-text, turn detection, interruption handling, and text-to-speech to existing LLM agents. The pitch is that teams can keep their current model stack and add voice without rebuilding the whole agent.
Posts report SenseTime open-sourced SenseNova U1, a unified text-image model with interleaved generation, 8-step distilled LoRA and ComfyUI workflows. They cite 2K image times around 15 seconds and H100 inference cuts to about 2 seconds, so compare it against your current image pipeline.
Tencent released HY-World 2.0 with WorldMirror 2.0 code and weights for turning text, images, or video into persistent 3D scenes. The output includes navigable geometry and camera data instead of disposable video frames.