CityPulse: Multi-Modal Video Understanding

Building CityPulse: integrating LLaVA, Whisper, and Llama models with pgvector for hyperlocal video understanding and semantic search on local AI infrastructure.

Overview

I built CityPulse, a hyperlocal video discovery platform for NYC that uses multi-modal AI to understand what’s actually happening in videos - not just what users say in titles or hashtags. The Core Technical Innovation: Traditional video platforms rely on user-provided metadata. We extract semantic meaning from the content itself by combining three AI modalities:
Vision (LLaVA 13b) - Frame-by-frame analysis detecting venues, atmospheres, activities, and reading signs/text
Audio (Whisper) - Transcribing what people say, detecting music vs speech, extracting venue names from conversation
Text (Llama 3.2) - Generating context-aware titles and summaries from combined visual + audio understanding
These combine into 384-dimensional embeddings stored in PostgreSQL with pgvector, enabling semantic search like: “Show me comedy clubs in Brooklyn” - which returns results by detecting microphones, stages, and audience setups in frames, even when “comedy” appears nowhere in the title. Technical Deep Dive:
Multi-modal RAG pipeline - How we combine vision, audio, and text into searchable embeddings
Progressive enhancement architecture - Videos visible in 5 seconds, fully AI-enhanced in 90 seconds through 3-tier incremental processing
Semantic search with pgvector - Cosine similarity search returning results based on understanding not keyword matching
Local AI stack - Running Whisper, LLaVA, and Llama models for privacy and customization (no external APIs)
Parallel processing patterns - SQLAlchemy async session management for concurrent vision + audio analysis
The “Wow” Moment: Search “street art in Brooklyn” and get videos titled “Morning Walk” - because our vision model detected graffiti and murals in the frames. Ask “What’s happening in Williamsburg tonight?” and get AI-generated summaries from videos uploaded in the last 24 hours, understanding context across multiple sources. This is a code walkthrough of building multi-modal understanding for real-world video content, handling the messy reality of street-level footage where the interesting context isn’t in metadata - it’s in what the AI sees and hears.

Links

https://www.pulse-nyc.com
Multimodal AI uses computer vision for semantic, hyperlocal NYC video search.

Tech stack