Archie: Hybrid Multimodal RAG

A deep dive into building multimodal RAG from scratch, covering data ingestion, embedding comparisons (CLIP vs. others), and hybrid retrieval architecture for technical diagrams.

Overview

While text-based RAG is a solved problem, building a truly multimodal retrieval pipeline remains the Wild West. At Archie, we needed our AI to understand not just code repositories, but the visual context of architecture diagrams and screenshots.

In this talk, I will share the engineering journey of building a multimodal RAG system from scratch when no tutorials existed. I will cover:

Data Ingestion: How to process and chunk distinct modalities (images vs. text) effectively.

Embedding Strategies: Comparing CLIP vs. newer multimodal embedding models—and what actually worked for technical diagrams.

Retrieval Architecture: How we structured our vector search to perform hybrid retrieval (text + image) to ground the MLLM in the correct context.

The “Gotchas”: Specific failures we encountered when trying to scale vision-based retrieval.

Links

Tech stack