BidirLM-Omni-2.5B-Embedding launches 2048-dim text-image-audio vectors
BidirLM released a 2.5B multilingual encoder that embeds text, images, and audio into one shared 2048-dimensional space and works directly with Sentence Transformers. It tops several open-data embedding leaderboards and can run locally on GPU.

TL;DR
- tomaarsen's launch thread introduced BidirLM-Omni-2.5B-Embedding as a single encoder for text, images, and audio, with all three modalities mapped into one shared 2048-dimensional space.
- According to tomaarsen's benchmark summary, the model ranks as the top open-data system on MTEB Multilingual V2, the top model at any size on MIEB, and the top sub-7B model on MAEB.
- tomaarsen's specs post says the model has 2.5B parameters, supports 119 languages, uses a 1k token max sequence length, is built on Qwen3, and ships under Apache 2.0.
- tomaarsen's Sentence Transformers example shows the practical bit: one
model.encode()interface handles strings, PIL images, and audio arrays out of the box. - The linked Hugging Face model card, release writeup, and MTEB leaderboard make this feel more like a real retrieval component than a research teaser, alongside the launch specs.
You can jump straight to the model card, the longer release writeup, and the public MTEB leaderboard. The Sentence Transformers snippet is unusually clean for a multimodal release, and the launch chart puts text, image, and audio results on the same size-versus-score view instead of hiding each benchmark in a separate appendix.
Shared embedding space
The headline feature is simple: one bidirectional encoder emits cosine-comparable vectors for text, images, and audio in the same 2048-dimensional space, according to the launch thread and the specs post. That is the useful part for retrieval systems, because cross-modal search usually gets messier when each modality lives behind its own encoder family.
The attached chart in tomaarsen's launch post positions the model against open-data and closed-data baselines across MTEB Multilingual V2, MIEB, and MAEB. The point is not just that it covers three modalities, but that the same 2.5B model shows up on all three plots.
Benchmark claims
The benchmark pitch breaks into three concrete claims:
- Text: #1 open-data model on MTEB Multilingual V2, #15 overall, per tomaarsen's summary.
- Image: #1 model at any size on MIEB, per the same summary.
- Audio: #1 sub-7B model on MAEB, and #2 overall, per the MAEB claim.
Those claims line up with the public MTEB leaderboard linked via antoine_chaffin's repost. The open-data framing matters here, because the launch chart explicitly compares against both open-data and closed-data systems.
One encode call
The most developer-friendly reveal is that the model plugs straight into Sentence Transformers. In the code example, the same SentenceTransformer("BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True) instance encodes:
- strings
- PIL images
- audio dictionaries with
arrayandsampling_rate
That removes a lot of the usual ceremony around multimodal embeddings. Instead of separate wrappers or endpoint-specific payloads, the example shows one API surface with modality-specific input types.
Specs and lineup
The published specs in tomaarsen's post are compact:
- 2.5B parameters
- local GPU inference
- 2048-dimensional shared embeddings
- 119 supported languages, with 87 reinforced through contrastive training
- 1k token maximum sequence length
- Qwen3-based architecture
- Apache 2.0 license
The Omni model also sits inside a broader BidirLM-Embedding family. tomaarsen's lineup post lists text-only variants at 270M, 0.6B, 1B, and 1.7B parameters, collected on the Hugging Face collection page.
Local deployment notes
There is one extra deployment detail tucked outside the main thread: tomaarsen's MPS note points to faster safetensors loading on Apple MPS devices. That is not a benchmark claim, but it is a concrete sign the release is being shaped for local use rather than just leaderboard screenshots.
For the canonical package pages, tomaarsen's follow-up links the full release writeup, while another follow-up points to the model card.