Google put Gemini Embedding 2 into public preview with one vector space for text, images, video, audio, and PDFs, plus 3072, 1536, and 768 output sizes. Use it to replace multi-model retrieval pipelines with one API for RAG and cross-media search.

Gemini Embedding 2 is a new public-preview embedding model that maps “5 modalities in a single unified embedding space,” according to Google's launch thread. The supported inputs are unusually broad for one endpoint: text, images, video, audio, and documents, with the API docs describing it as gemini-embedding-2-preview and positioning it alongside the older text-only gemini-embedding-001.
The implementation details matter for retrieval system design. Google's feature list says the model supports “up to 8,192 input tokens,” “up to 6 images,” “120s video,” and “audio natively, no transcription step needed,” plus PDFs up to 6 pages per request. That means cross-modal search no longer has to start with separate captioning, ASR, or document-only preprocessing for every asset type. A supporting summary from an engineer recap also highlights the output-size dial: 3,072, 1,536, or 768 dimensions via Matryoshka Representation Learning.
Google's model thread claims state-of-the-art performance across both unimodal and multimodal tasks, and the attached table is the most concrete evidence in the launch set. On text-text retrieval, Gemini Embedding 2 posts 69.9 on multilingual MTEB mean task score versus 68.4 for Google's prior text model, and 84.0 on MTEB Code where the older model shows 76.0. On text-image and image-text retrieval, the
shows large jumps over Google's legacy multimodal model, including 89.6 vs. 74.0 on TextCaps text-image recall@1 and 97.4 vs. 88.1 on TextCaps image-text recall@1.
The table also shows stronger document, video, and speech retrieval coverage than Google's previous offerings. Gemini Embedding 2 reaches 64.9 on ViDoRe v2 text-document nDCG@10, 68.0 on MSR-VTT text-video nDCG@10, 52.5 on YouCook2, and 73.9 on MSEB speech-text mrr@10 in the published comparison. One caveat from that same chart: some competitor figures are marked unavailable or self-reported, and Voyage Multimodal 3.5 slightly edges Gemini on ViDoRe v2 at 65.5 versus 64.9.
The practical pitch from early adopters is pipeline simplification. In one practitioner writeup, the claim is “one API call now handles all your media,” replacing stacks that previously chained “audio to text, images to captions” before embedding. That is the clearest deployment implication here: a text query can retrieve non-text assets directly because the assets live in one shared vector space rather than separate modality-specific indexes.
The first community experiments are already centered on search infrastructure rather than demos. A local prompting-tools test says integrating Gemini Embedding 2 to improve search was “as simple as asking OpenClaw,” while the LlamaParse post shows an “integrated solution” for parsing, embedding, and searching audio files, PDFs, PowerPoints, and videos in one knowledge base. Google's own documentation also ties the model to semantic search, classification, clustering, and RAG, which fits the launch narrative better than pure showcase content.
The storage-performance tradeoff is also more explicit than in many embedding launches. The launch thread and builder explanation both call out Matryoshka Representation Learning, where information is nested so developers can shrink vectors from 3,072 to 1,536 or 768 dimensions. That gives teams a concrete knob for index size, memory footprint, and retrieval quality without swapping to a different model family.
Google now lets Gemini chain built-in tools like Search, Maps, File Search, and URL Context with custom functions inside a single API call. This removes orchestration glue for agent builders and brings Maps grounding into AI Studio for faster prototyping.
releaseOpenClaw shipped version 2026.3.22 with ClawHub, OpenShell plus SSH sandboxes, side-question flows, and more search and model options, then followed with a 2026.3.23 patch. Teams get a broader plugin surface, but should patch quickly and review plugin trust boundaries as the ecosystem grows.
releaseCursor shipped Instant Grep, a local regex index built from n-grams, inverted indexes, and Bloom filters that drops large-repo searches from seconds to milliseconds. Faster candidate retrieval shortens the coding-agent loop, especially when ripgrep-style scans become the bottleneck.
breakingChatGPT now saves uploaded and generated files into an account-level Library that can be reused across conversations from the web sidebar or recent-files picker. It removes repetitive re-uploading and makes past PDFs, spreadsheets, and images part of a persistent working context.
breakingEpoch AI says GPT-5.4 Pro elicited a publishable solution to one 2019 conjecture in its FrontierMath Open Problems set, with a formal writeup planned. Treat it as an early milestone worth reproducing, not blanket evidence that frontier models can already automate math research.
Say hello to Gemini Embedding 2, our new SOTA multimodal model that lets your bring text, images, video, audio, and docs into the same embedding space! 👀
Parse, embed, and search for your audio files - or PDFs, or Powerpoints, or videos - in one integrated solution. @GoogleDeepMind released Gemini Embedding 2, an all-in-one model that unifies the embedding space between text/images/audio/video. We built a tutorial that shows you Show more
🚀 The team at @GoogleDeepMind just released Gemini Embedding 2, a frontier embeddings model with 3072 dimensions and state-of-the-art semantic quality. 👩💻 We built a demo showing how to integrate it across the LlamaIndex ecosystem, from LlamaParse to LlamaAgents:
Incorporating Gemini Embedding 2 to improve search on my local prompting tools, as simple as asking OpenClaw (with Gemini 3.1 Pro):
Gemini Embedding 2 is out. It’s a natively multimodal embedding model that maps text, images, video, audio and documents into a single embedding space: - text, up to 8192 input tokens - images, up to 6 images per request - videos, up to 120 seconds - audio, natively ingests and
What if one embedding model could understand text, images, video, audio, and PDFs all at once? Excited to share Gemini Embedding 2 our first fully multimodal embedding model. 🖼️ 5 modalities in a single unified embedding space 🌍 Supports up to 8,192 input tokens, 100+ languages Show more