Gemini vs YOLO Object Detection

Compare using Gemini 2.5 Flash-Lite versus YOLO for real-world object detection in a Telegram app, covering costs, failure handling, and limitations.

Overview

In this demo, I will show how I use an LLM for object detection in a Telegram Mini App, and compare this approach with classical vision models like YOLO.

In the app, users need to take a photo of a physical gift before putting it into a public box. The photo is sent to the backend, where an LLM (Gemini 2.5 Flash-Lite) looks at the image and returns a simple result: object category, confidence, and a short description.

The LLM does not make final decisions. Its output is checked by simple rules in the backend, which decide if the user can continue, if the gift should be blocked, or if an admin needs to review it.

I will explain why I chose an LLM instead of a classical CV pipeline, how much it costs per request, how I handle failed model responses, and in which cases this approach works worse than models like YOLO.

Tech stack