Penelope: Agentic LLM Testing Orchestrator

Explore Penelope, an agentic orchestrator coordinating multi-step LLM tests. Learn how it manages workflows, model calls via LiteLLM, and structured evaluation routines.

Video

Overview

This talk covers the architecture and implementation of Penelope, the agentic orchestrator used in the Rhesis framework for testing LLM applications. Penelope acts as a control agent that coordinates multi-step test executions, model calls, and evaluation routines. The session will explain how Penelope manages test definitions, executes adaptive workflows, and interacts with model endpoints via LiteLLM. I will discuss how evaluation tasks are modeled as agent goals, how results are captured in structured form, and how the system supports reproducible multi-turn tests. We will also look at the interface between the orchestration layer and the evaluation layer, including how LLMs are used to generate test cases, expected behaviors, and automatic scoring prompts.

Links

https://github.com/rhesis-ai/rhesis/
Rhesis is an open-source platform generating comprehensive, automated Gen AI test scenarios using LLMs.
https://docs.rhesis.ai/penelope
Penelope autonomously executes multi-turn, LLM-driven tests against conversational AI endpoints.

Tech stack