How to Build a Multimodal AI Knowledge Base With Gemini Embedding 2

Build a multimodal RAG with Gemini Embedding 2: search text, images, PDFs, video, and audio in one shared vector space. The open-source AI explained.

Last updated: Jul 8, 202611 mins read

Loading table of contents...

📝 Audio Version

Search visual data with Gemini Embedding 2

What if uploading one image could instantly surface every related report, spec sheet, and past version across your whole company? That is the thing I built, and in this post I will show you how. It is a multimodal RAG with Gemini Embedding 2 at its core: a knowledge base that stores text, images, PDFs, video, and audio in one place and lets you search across all of it. You can type a question and get back the right image. You can upload an image and get back the matching document. You can chat with your files and actually see the pages they came from.

The whole thing runs locally with a modern React interface and ships as a single Docker image. The code is open source. Let me walk you through why this works now when it did not before, and how the pieces fit together.

Library page of the knowledge base with the embedded files

Why Text-Only Knowledge Bases Are Blind

Most knowledge bases only understand text. That is the problem.

The standard setup is called RAG, short for retrieval-augmented generation. In plain terms: you store your documents as numbers (embeddings), and when someone asks a question, the system finds the closest matching chunks and feeds them to an AI model to write an answer. It works great for text. It falls apart the moment your data is visual.

Think about what that means in practice. You have a PDF with a chart that holds the actual number someone needs. Text-only RAG reads the prose around the chart and misses the chart itself. You have designs in Figma, specs in Notion, and reports in Drive. You cannot put them in one searchable place because half of them are pictures. Diagrams, screenshots, scanned contracts, product photos: all invisible.

This is not a small gap. Industry estimates put 80 to 90 percent of enterprise data in unstructured, multimodal formats, while roughly 80 percent of RAG systems still only handle text 1. So most companies are searching a thin slice of what they actually own.

The Old Workaround: Two Models, Two Headaches

Before, the way around this was to glue two models together. You used something like CLIP to handle images, and a separate text embedding model for your documents. CLIP pairs a vision encoder and a text encoder and aligns them with training so they sort of agree.

The catch is you end up with two different vector spaces. Image numbers live in one space, text numbers in another. To search across both, you write a pile of custom logic to combine and reconcile the results. It is fragile, and image retrieval often comes back worse than text retrieval anyway. More on why that happens later.

What Is Multimodal RAG?

Multimodal RAG is retrieval-augmented generation that searches over text, images, PDFs, video, and audio together, instead of text alone. One model turns every file type into numbers in the same shared space, so a single query can pull the most relevant result no matter what format it lives in.

The difference from regular RAG comes down to three things:

Data types. Regular RAG indexes extracted text. Multimodal RAG indexes the visual and audio content directly, so charts, diagrams, and screenshots stay searchable.
Embedding space. Regular RAG has one text space. The old multimodal hacks had two spaces and merge code. Modern multimodal RAG uses one shared space for everything.
What the answer is grounded in. Regular RAG cites paragraphs. Multimodal RAG can cite the exact image or PDF page, and show it to you.

What a Shared Embedding Space Actually Means

An embedding space is just a map where similar things sit close together. A shared embedding space means text and images and video all get placed on the same map.

That is the whole trick. When a photo of a chip and the words "photo with a chip" land near each other on the same map, you can search one with the other. No translation layer. No merge logic. They already speak the same language from the start.

What Is Gemini Embedding 2?

Gemini Embedding 2 is Google's first natively multimodal embedding model. One model maps text, images, video, audio, and documents into a single shared space across more than 100 languages 2. It launched in public preview on March 10, 2026, and later reached general availability through the Gemini API and Vertex AI.

"Natively multimodal" is the key phrase. It is built on the Gemini foundation model, not bolted together from a vision encoder and a text encoder like CLIP. The cross-modal understanding is baked in end to end. That is why it skips the two-space problem entirely.

One thing to keep straight: do not confuse it with the older gemini-embedding-001, which is text only, or with EmbeddingGemma, which is a small open text-only model for on-device use. Despite the similar names, only Gemini Embedding 2 handles images, PDFs, video, and audio. If you pick the wrong one, none of this works.

It is also not the only player. Cohere Embed 4 and Voyage multimodal-3 are native multimodal embedders too, and there is a separate school of thought (ColPali, ColQwen2) that treats whole pages as images. Gemini Embedding 2 is a strong option, not the only one. I picked it because the quality is excellent and the API is simple.

Modalities, Limits, and Cost

Here are the per-request limits you actually need to know when building 2:

Text: up to 8,192 tokens
Images: up to 6 images
Video: up to 120 seconds
Audio: up to 180 seconds
PDFs: up to 6 pages

Pricing is metered per modality, per million tokens: about $0.20 for text, $0.45 for images, $6.50 for audio, and $12.00 for video, with a 50 percent discount on the Batch API 3. Treat those as point-in-time numbers and check Google's pricing page before you scale. Video and audio are the expensive ones, which matters for how you chunk them.

Matryoshka: Why You Can Shrink the Vector and Keep the Quality

By default the model outputs a 3,072-dimensional vector. That is a long list of numbers per item, and storing millions of them adds up.

Here is the clever part. Google trained it with Matryoshka Representation Learning, or MRL. The name comes from the nesting dolls. The most important information is packed into the front of the vector. So you can chop the vector down, say to 768 dimensions, and still get excellent retrieval quality 4. Google recommends 3,072, 1,536, and 768 as the solid tiers.

I use 768 in this project. You get strong semantic understanding while keeping the vectors small, which means faster searches and lower storage costs in your vector database. The original MRL research reported up to 14 times smaller embeddings and 14 times faster retrieval at the same accuracy 4. That is a real win for almost no downside.

Cross-modal retrieval means searching one type of content with a different type. You search with text and get back images. You search with an image and get back text. Because everything sits in the same shared space, the system just looks for the nearest neighbors regardless of format.

How to Search Images With Text

The flow is simple:

Embed the query text with Gemini Embedding 2.
Compare that vector against the image vectors already in your database.
Return the closest matches, ranked by similarity.

In my demo I typed "photo with a chip" and the top result was exactly that, a photo of a chip. Then "photo with a motherboard" pulled up the motherboard shot. No tags, no captions, no manual metadata. The model understood the picture when it was ingested.

Search results after searching for "photo with a chip"

How to Search by Uploading an Image

Same idea, reversed. You embed the uploaded image instead of text, then find the nearest neighbors. I took a screenshot of a graphic and the system pulled up the exact pages in a 200-plus-page PDF where that graphic appears. It found it on several pages, because the content matched on all of them.

That is the moment the whole approach clicks. One image in, every related document out.

Searching by image

Storing Everything in One Vector Space With ChromaDB

For the vector database I used ChromaDB. It is open source, easy to start with, and you can install it with one command. It keeps things in a single directory so there is no heavy infrastructure to stand up while you are building 5.

The pattern is straightforward: write a custom embedding function that calls Gemini Embedding 2, then store the resulting vectors in a Chroma collection along with metadata like the file name and, for PDFs, the page number.

1import chromadb
2
3client = chromadb.PersistentClient(path="./kb")
4collection = client.get_or_create_collection("knowledge_base")
5
6# embed_with_gemini() calls Gemini Embedding 2 and truncates to 768 dims
7collection.add(
8    ids=["offers_book_p42"],
9    embeddings=[embed_with_gemini(page_image, dimensions=768)],
10    metadatas=[{"source": "100m_offers.pdf", "page": 42, "type": "pdf_page"}],
11)

Does ChromaDB Store the Actual Files? No, and That Matters

ChromaDB stores the embeddings and metadata, not your original images and PDFs. The raw files live wherever you keep them, like object storage or a local folder, and you point to them with a path or URL in the metadata. So when a search returns a hit, you use that metadata to fetch and show the real file. Plan your storage with that split in mind.

How to Chunk PDFs, Video, and Audio

The per-request limits decide your chunking strategy.

For PDFs, the cap is 6 pages at once, so larger files have to be split. I split them page by page. It costs a few more embedding calls, but the payoff is big: I can retrieve the exact page that matches a query and I always know which page it came from. That beats retrieving a giant blob and making the AI hunt through it.

For video, the cap is 120 seconds. You have two good options. If a clip is under 2 minutes, embed the whole thing. For longer or more precise needs, extract frames at regular intervals, embed those frames as images, and store the timestamp with each one. Then a query can jump to the exact moment in the video. Audio works the same way under its 180-second limit.

Do You Need a Multimodal LLM, or Just Multimodal Embeddings?

For finding the right content, you only need multimodal embeddings. The embedding model handles retrieval. The multimodal LLM matters at the next step, when you want the AI to read the retrieved images and pages and write an answer about them.

So in this build, Gemini Embedding 2 does the searching, and a vision-capable model does the chatting. In the demo I asked for "a graphic where Hormozi explains the 100 million offer model" and got a text summary plus the relevant pages. I asked for "the video where I show AI benchmarks" and it pulled up the right screen recording. That is chat with vision: the answer and the source images, together.

Why Image Retrieval Is Often Harder Than Text

Worth saying plainly, because it trips people up. With the old CLIP-style setups, image and text vectors tend to cluster apart even in a shared space. People call this the modality gap. CLIP also tends to over-score generic-looking images, so you get visually bland results that do not actually match the question.

Native multimodal models like Gemini Embedding 2 narrow that gap a lot, which is the main reason I moved off the two-model approach. If you do hit weak image results, the common fixes are reranking the top candidates and adding keyword (hybrid) search alongside the vector search. One honest caveat: most rerankers today are still text-only, so reranking across images and text is not yet plug-and-play.

The Architecture: Four Thin Layers

The stack is deliberately simple. Four layers:

A React frontend with Vite, TypeScript, shadcn/ui, and TanStack Query. Upload, search, and a chat window.
A FastAPI Python backend exposing the endpoints.
Two scripts, one for ingestion (turning files into vectors) and one for retrieval (running queries).
ChromaDB as the vector store, with Gemini Embedding 2 as the single model behind everything.

The backend exposes the three endpoints any RAG app needs: one to ingest files, one to search, and one to generate a chat answer. The chat answer streams back token by token using Server-Sent Events, so the reply appears as it is written instead of after a long wait. The frontend gives you the upload box, the search bar that also accepts an image, and the chat window.

How I Actually Built It

I always start with a task list. I have my coding agent write a detailed, step-by-step plan first, and I tell it to write that plan for an AI agent, not for a human. The tone is imperative: do this, then do this. It is shorter, it saves tokens, and it leaves no room for the agent to guess.

In the first prompt, I roughly described the architecture I wanted, after doing a bit of research up front. The agent turned that into a task document with the key features, the components like Server-Sent Events, the exact stack for the backend and frontend, the API endpoints, the schemas, and the layout.

Then I work through it in phases instead of asking for everything at once. Phase one is the backend skeleton. Then the backend routes. Then the frontend. Then the frontend hooks and API layer. Then the components. Then I wire both sides together, and optionally dockerize it. Going phase by phase keeps the agent's context clean, and it stays far more accurate that way.

Two parts of that task document earn their keep. A verification checklist, so the agent can confirm each phase works. And an out-of-scope list, so it does not wander off and add authentication or features I never asked for. That list is the difference between a focused build and a hallucinated mess.

When Multimodal RAG Is Worth It

If all your data is plain text, you do not need any of this. Text-only RAG is simpler and cheaper, so use it.

You need multimodal RAG when your real content is visual or spoken: scanned documents, screenshots, charts and tables, product photos, design files, recordings. That is most companies, which is the whole point. The information you most want to search is the information text-only systems cannot see.

What's Next

This is a starter, not a finished product. The obvious next steps are richer timestamps for video frames, an OCR pipeline with stronger models for handwriting and complex tables, and agentic workflows so your AI agents can query this knowledge base on their own.

Everything here is open source except the Gemini model itself. I turned the project into a template you can clone and build on. Grab it, point it at your own files, and you have a knowledge base that finally understands all of your content, not just the words.

Starter Template: GitHub

References

"The Multimodal Retrieval Gap: Why Text-Only RAG Fails When 90% of Your Data Isn't Text". RAG About It. 2026.
"Gemini Embedding 2: Our First Natively Multimodal Embedding Model". Google. March 10, 2026.
"Building With Gemini Embedding 2: Agentic Multimodal RAG and Beyond". Google Developers Blog. March 2026.
"Matryoshka Representation Learning". Kusupati et al., NeurIPS 2022.
"Chroma: The Open-Source Vector Database". Chroma. 2026.

#Artificial Intelligence #Development #Document Management

About the author

Tobias Wupperfeld

Tobias is an independent AI engineer and operator who has shipped AI systems inside startups and scale-ups across fintech, procurement, and developer tooling. He runs Made By Agents and consults for JAN3, where he leads AI integration across the AQUA product line.

Keep reading

More Guides From the Blog

We write about coding agents, multi-agent systems, AI pair programming, and the engineering practices we use with clients. Hands-on lessons from real projects, not high-level theory.

Browse all articles

Best Open Source OCR for AI Agents: The 2026 Document Pipeline

The best open source OCR for AI agents in 2026: VLM vs traditional OCR, PaddleOCR, Docling, GLM-OCR, LangExtract, and a full document pipeline.

12 mins read

Tobias shaking hands with Jensen Huang from NVIDIA. In the center between them stands a bold text "$250,000 token spent"

How to Become an AI-First Company: The Playbook I Use With My Clients

A practical playbook for becoming an AI-first company. Mindset, AI audit, roadmap, agents, and culture, from a working AI consultant.

11 mins read

How to Build Free Tools With Claude Code for Backlinks (My Full Playbook)

My full playbook for building free interactive tools with Claude Code that earn backlinks, lift time on page, and feed a monetization flywheel.

12 mins read

How to Choose Hardware for Running Local LLMs, and Know Exactly When It Beats the Claude API

An interactive directory that matches GPUs, Macs, and edge devices to local AI models, plus an ROI calculator vs Claude, GPT, and Gemini.

11 mins read

Caffeine.ai vs Replit: Why I Switched My Vibe Coding to the Internet Computer

I built an app on Caffeine.ai v3 and Replit side by side. Here's what on-chain deployment changes when AI models can chain zero-days for $2,000.

10 mins read

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

How to Build a Multimodal AI Knowledge Base With Gemini Embedding 2

Build a multimodal RAG with Gemini Embedding 2: search text, images, PDFs, video, and audio in one shared vector space. The open-source AI explained.

Last updated: Jul 8, 202611 mins read

Loading table of contents...

📝 Audio Version

Search visual data with Gemini Embedding 2

Library page of the knowledge base with the embedded files

Why Text-Only Knowledge Bases Are Blind

Most knowledge bases only understand text. That is the problem.

The Old Workaround: Two Models, Two Headaches

What Is Multimodal RAG?

The difference from regular RAG comes down to three things:

Data types. Regular RAG indexes extracted text. Multimodal RAG indexes the visual and audio content directly, so charts, diagrams, and screenshots stay searchable.
Embedding space. Regular RAG has one text space. The old multimodal hacks had two spaces and merge code. Modern multimodal RAG uses one shared space for everything.
What the answer is grounded in. Regular RAG cites paragraphs. Multimodal RAG can cite the exact image or PDF page, and show it to you.

What a Shared Embedding Space Actually Means

An embedding space is just a map where similar things sit close together. A shared embedding space means text and images and video all get placed on the same map.

What Is Gemini Embedding 2?

Modalities, Limits, and Cost

Here are the per-request limits you actually need to know when building 2:

Text: up to 8,192 tokens
Images: up to 6 images
Video: up to 120 seconds
Audio: up to 180 seconds
PDFs: up to 6 pages

Matryoshka: Why You Can Shrink the Vector and Keep the Quality

By default the model outputs a 3,072-dimensional vector. That is a long list of numbers per item, and storing millions of them adds up.

How to Search Images With Text

The flow is simple:

Embed the query text with Gemini Embedding 2.
Compare that vector against the image vectors already in your database.
Return the closest matches, ranked by similarity.

Search results after searching for "photo with a chip"

How to Search by Uploading an Image

That is the moment the whole approach clicks. One image in, every related document out.

Searching by image

Storing Everything in One Vector Space With ChromaDB

1import chromadb
2
3client = chromadb.PersistentClient(path="./kb")
4collection = client.get_or_create_collection("knowledge_base")
5
6# embed_with_gemini() calls Gemini Embedding 2 and truncates to 768 dims
7collection.add(
8    ids=["offers_book_p42"],
9    embeddings=[embed_with_gemini(page_image, dimensions=768)],
10    metadatas=[{"source": "100m_offers.pdf", "page": 42, "type": "pdf_page"}],
11)

Does ChromaDB Store the Actual Files? No, and That Matters

How to Chunk PDFs, Video, and Audio

The per-request limits decide your chunking strategy.

Do You Need a Multimodal LLM, or Just Multimodal Embeddings?

Why Image Retrieval Is Often Harder Than Text

The Architecture: Four Thin Layers

The stack is deliberately simple. Four layers:

A React frontend with Vite, TypeScript, shadcn/ui, and TanStack Query. Upload, search, and a chat window.
A FastAPI Python backend exposing the endpoints.
Two scripts, one for ingestion (turning files into vectors) and one for retrieval (running queries).
ChromaDB as the vector store, with Gemini Embedding 2 as the single model behind everything.

How I Actually Built It

When Multimodal RAG Is Worth It

If all your data is plain text, you do not need any of this. Text-only RAG is simpler and cheaper, so use it.

What's Next

Starter Template: GitHub

References

"The Multimodal Retrieval Gap: Why Text-Only RAG Fails When 90% of Your Data Isn't Text". RAG About It. 2026.
"Gemini Embedding 2: Our First Natively Multimodal Embedding Model". Google. March 10, 2026.
"Building With Gemini Embedding 2: Agentic Multimodal RAG and Beyond". Google Developers Blog. March 2026.
"Matryoshka Representation Learning". Kusupati et al., NeurIPS 2022.
"Chroma: The Open-Source Vector Database". Chroma. 2026.

#Artificial Intelligence #Development #Document Management

About the author

Tobias Wupperfeld

Keep reading

More Guides From the Blog

We write about coding agents, multi-agent systems, AI pair programming, and the engineering practices we use with clients. Hands-on lessons from real projects, not high-level theory.

Browse all articles

Best Open Source OCR for AI Agents: The 2026 Document Pipeline

The best open source OCR for AI agents in 2026: VLM vs traditional OCR, PaddleOCR, Docling, GLM-OCR, LangExtract, and a full document pipeline.

12 mins read

How to Become an AI-First Company: The Playbook I Use With My Clients

A practical playbook for becoming an AI-first company. Mindset, AI audit, roadmap, agents, and culture, from a working AI consultant.

11 mins read

How to Build Free Tools With Claude Code for Backlinks (My Full Playbook)

My full playbook for building free interactive tools with Claude Code that earn backlinks, lift time on page, and feed a monetization flywheel.

12 mins read

How to Choose Hardware for Running Local LLMs, and Know Exactly When It Beats the Claude API

An interactive directory that matches GPUs, Macs, and edge devices to local AI models, plus an ROI calculator vs Claude, GPT, and Gemini.

11 mins read

Caffeine.ai vs Replit: Why I Switched My Vibe Coding to the Internet Computer

I built an app on Caffeine.ai v3 and Replit side by side. Here's what on-chain deployment changes when AI models can chain zero-days for $2,000.

10 mins read

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

How to Build a Multimodal AI Knowledge Base With Gemini Embedding 2

Table of Contents

📝 Audio Version

Why Text-Only Knowledge Bases Are Blind

The Old Workaround: Two Models, Two Headaches

What Is Multimodal RAG?

What a Shared Embedding Space Actually Means

What Is Gemini Embedding 2?

Modalities, Limits, and Cost

Matryoshka: Why You Can Shrink the Vector and Keep the Quality

What Is Cross-Modal Retrieval?

How to Search Images With Text

How to Search by Uploading an Image

Storing Everything in One Vector Space With ChromaDB

Does ChromaDB Store the Actual Files? No, and That Matters

How to Chunk PDFs, Video, and Audio

Do You Need a Multimodal LLM, or Just Multimodal Embeddings?

Why Image Retrieval Is Often Harder Than Text

The Architecture: Four Thin Layers

How I Actually Built It

When Multimodal RAG Is Worth It

What's Next

References

More Guides From the Blog

Best Open Source OCR for AI Agents: The 2026 Document Pipeline

How to Become an AI-First Company: The Playbook I Use With My Clients

How to Build Free Tools With Claude Code for Backlinks (My Full Playbook)

How to Choose Hardware for Running Local LLMs, and Know Exactly When It Beats the Claude API

Caffeine.ai vs Replit: Why I Switched My Vibe Coding to the Internet Computer

The AI Build Report

The AI Build Report

How to Build a Multimodal AI Knowledge Base With Gemini Embedding 2

Table of Contents

📝 Audio Version

Why Text-Only Knowledge Bases Are Blind

The Old Workaround: Two Models, Two Headaches

What Is Multimodal RAG?

What a Shared Embedding Space Actually Means

What Is Gemini Embedding 2?

Modalities, Limits, and Cost

Matryoshka: Why You Can Shrink the Vector and Keep the Quality

What Is Cross-Modal Retrieval?

How to Search Images With Text

How to Search by Uploading an Image

Storing Everything in One Vector Space With ChromaDB

Does ChromaDB Store the Actual Files? No, and That Matters

How to Chunk PDFs, Video, and Audio

Do You Need a Multimodal LLM, or Just Multimodal Embeddings?

Why Image Retrieval Is Often Harder Than Text

The Architecture: Four Thin Layers

How I Actually Built It

When Multimodal RAG Is Worth It

What's Next

References

More Guides From the Blog

Best Open Source OCR for AI Agents: The 2026 Document Pipeline

How to Become an AI-First Company: The Playbook I Use With My Clients

How to Build Free Tools With Claude Code for Backlinks (My Full Playbook)

How to Choose Hardware for Running Local LLMs, and Know Exactly When It Beats the Claude API

Caffeine.ai vs Replit: Why I Switched My Vibe Coding to the Internet Computer

The AI Build Report

The AI Build Report