Altura: LLM Evaluations

Learn about a practical setup for LLM evaluation in production, sharing hard-earned lessons for guiding prompt and code changes in an API used across five countries.

Overview

LLM evaluation is quite a new field: in production applications where you can’t use unit tests to test your code because LLM output can change any time, it’s still incredibly important to set up some kind of metrics to actually figure out if what you’re doing is making a positive (and maybe more importantly, not a negative) impact on your AI application. Because this is so new, we had to jump through a bunch of hoops at Altura to make it work for us. In this talk, I’d like to present the setup we use to guide us in implementing prompt and code changes in our API that helps bid managers with their proposals in 5 countries.

Tech stack