Ensemble LLM Judge Bias Reduction

Demonstrates ensemble LLM judging on ELI5 abstracts, using direct and pairwise evaluations across 1.6 M runs to cut bias, showing GPT‑OSS as top, strictest judge.

Overview

In a recent example guide for Sutro, we showed how you can remove biases present in single LLM judges using an ensemble approaches. Using ELI5 explanations of Arxiv abstracts, we use both direct evaluation as well as pairwise comparisons at scale - running 1.6 million evals in total. We show that GPT-OSS is both the best model family for accomplishing the task, as well as the harshest judge of the others.

Links

https://docs.sutro.sh/examples/llm-as-a-judge
Sutro uses ensemble LLM-as-a-Judge for scalable, offline, relative model benchmarking.
https://sutro.sh/)

Tech stack