releaseMarch 12, 2026

Google releases Gemini Embedding 2 preview with one vector space for text, image, video, audio, and PDFs

Google launched Gemini Embedding 2 in preview, unifying multiple modalities and 100+ languages in one embedding space with flexible output dimensions. Try it to simplify cross-modal RAG and search pipelines, but compare it with late-interaction systems before committing.

3 min read

Google releases Gemini Embedding 2 preview with one vector space for text, image, video, audio, and PDFs

TL;DR

Google's launch summary says Gemini Embedding 2 is now in public preview as its first natively multimodal embedding model, putting text, images, video, audio, and documents into one semantic space.
According to Weaviate overview, the model supports 100+ languages, up to 8,192 input tokens, and output dimensions from 128 to 3,072, with availability through the Gemini API and Vertex AI.
The practical change is simpler cross-modal retrieval: Weaviate's integration post says one model can power search and RAG across mixed media instead of separate embedding pipelines.
Practitioner reaction is already centering on enterprise context layers and agent memory, where a retrieval thread argues multimodal embeddings make non-text artifacts like meeting audio, images, and PDF pages retrievable.

What shipped

Google positioned Gemini Embedding 2 as a single embedding model for text, images, video, audio, and PDFs, exposed in preview through the Gemini API and Vertex AI via the launch post. Weaviate's integration post adds the implementation details engineers will care about: support for 100+ languages, an 8,192-token max input, and configurable output sizes from 128 to 3,072 dimensions.

The ecosystem angle matters because this is not just a model card drop. Weaviate said it already works with its existing Google integration, and the attached

shows multi2vec_google_gemini targeting gemini-embedding-2-preview for multimodal collections. That makes the launch immediately relevant to teams already running vector search infrastructure rather than treating it as a future-only capability.

What changes for retrieval stacks

The main engineering claim is pipeline reduction. Google's demo post frames the model as search "across all your media at once," which means cross-modal lookup without separate text, image, audio, and video embedding stages. In practice, that should simplify multimodal RAG and recommendation systems that need to retrieve a concept from one format and return matches from another, a use case also echoed in the OpenClaw note about semantically storing images, videos, audio, and docs for agents.

That does not eliminate the retrieval design tradeoff. In the practitioner thread, Jo Kristian Bergum argues there is still "no single silver bullet" for agent retrieval and says embeddings matter because much of the context engineers want to feed agents "isn't represented in text." His examples—meeting notes, audio, images, and PDF-page images—line up with the exact artifact mix Gemini Embedding 2 targets. The likely near-term use is not replacing every retrieval stack, but expanding what can enter the same retrievable context layer with fewer modality-specific workarounds.

TL;DR

What shipped

What changes for retrieval stacks

Discussion across the web