Home / Technology / Google Unveils Multimodal AI for Deeper Understanding
Google Unveils Multimodal AI for Deeper Understanding
11 Mar
Summary
- New AI model integrates text, images, video, and audio.
- Reduces latency by up to 70% for some enterprise tasks.
- Features Matryoshka Representation Learning for flexible data processing.

Google has introduced Gemini Embedding 2, a public preview model designed to revolutionize how machines understand information. This advanced AI natively integrates text, images, video, and audio into a single numerical representation, a significant leap beyond previous text-centric models. Early adopters have observed latency reductions of up to 70%, enhancing enterprise AI efficiency.
The model's architecture allows for cross-modal retrieval, enabling searches across different media types. For instance, a text query can now find specific moments in a video. A unique feature, Matryoshka Representation Learning, offers flexibility by allowing data to be processed at varying dimensions, optimizing for precision or storage economy.
This multimodal capability addresses fragmented enterprise data, unifying audio, visual, and textual information into a cohesive knowledge base. This facilitates more advanced AI applications, such as improved retrieval-augmented generation (RAG) systems. Google's Gemini API and Vertex AI platform now host this preview, with tiered pricing structures for developers and enterprises, including a specific rate for native audio processing.




