Skip to content
AI Primer
release

BidirLM-Omni-2.5B-Embedding launches 2048-dim text-image-audio vectors

BidirLM released a 2.5B multilingual encoder that embeds text, images, and audio into one shared 2048-dimensional space and works directly with Sentence Transformers. It tops several open-data embedding leaderboards and can run locally on GPU.

4 min read
BidirLM-Omni-2.5B-Embedding launches 2048-dim text-image-audio vectors
BidirLM-Omni-2.5B-Embedding launches 2048-dim text-image-audio vectors

TL;DR

You can jump straight to the model card, the longer release writeup, and the public MTEB leaderboard. The Sentence Transformers snippet is unusually clean for a multimodal release, and the launch chart puts text, image, and audio results on the same size-versus-score view instead of hiding each benchmark in a separate appendix.

Shared embedding space

The headline feature is simple: one bidirectional encoder emits cosine-comparable vectors for text, images, and audio in the same 2048-dimensional space, according to the launch thread and the specs post. That is the useful part for retrieval systems, because cross-modal search usually gets messier when each modality lives behind its own encoder family.

The attached chart in tomaarsen's launch post positions the model against open-data and closed-data baselines across MTEB Multilingual V2, MIEB, and MAEB. The point is not just that it covers three modalities, but that the same 2.5B model shows up on all three plots.

Benchmark claims

The benchmark pitch breaks into three concrete claims:

Those claims line up with the public MTEB leaderboard linked via antoine_chaffin's repost. The open-data framing matters here, because the launch chart explicitly compares against both open-data and closed-data systems.

One encode call

The most developer-friendly reveal is that the model plugs straight into Sentence Transformers. In the code example, the same SentenceTransformer("BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True) instance encodes:

  • strings
  • PIL images
  • audio dictionaries with array and sampling_rate

That removes a lot of the usual ceremony around multimodal embeddings. Instead of separate wrappers or endpoint-specific payloads, the example shows one API surface with modality-specific input types.

Specs and lineup

The published specs in tomaarsen's post are compact:

  • 2.5B parameters
  • local GPU inference
  • 2048-dimensional shared embeddings
  • 119 supported languages, with 87 reinforced through contrastive training
  • 1k token maximum sequence length
  • Qwen3-based architecture
  • Apache 2.0 license

The Omni model also sits inside a broader BidirLM-Embedding family. tomaarsen's lineup post lists text-only variants at 270M, 0.6B, 1B, and 1.7B parameters, collected on the Hugging Face collection page.

Local deployment notes

There is one extra deployment detail tucked outside the main thread: tomaarsen's MPS note points to faster safetensors loading on Apple MPS devices. That is not a benchmark claim, but it is a concrete sign the release is being shaped for local use rather than just leaderboard screenshots.

For the canonical package pages, tomaarsen's follow-up links the full release writeup, while another follow-up points to the model card.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads
Benchmark claims1 post
Local deployment notes1 post