Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
Ensemble LLM Judge Bias Reduction
Demonstrates ensemble LLM judging on ELI5 abstracts, using direct and pairwise evaluations across 1.6 M runs to cut bias, showing GPT‑OSS as top, strictest judge.
In a recent example guide for Sutro, we showed how you can remove biases present in single LLM judges using an ensemble approaches. Using ELI5 explanations of Arxiv abstracts, we use both direct evaluation as well as pairwise comparisons at scale - running 1.6 million evals in total. We show that GPT-OSS is both the best model family for accomplishing the task, as well as the harshest judge of the others.
Sutro uses ensemble LLM-as-a-Judge for scalable, offline, relative model benchmarking.