CompileBench: LLMs Compiling Chrome

The talk introduces CompileBench, an evaluation that requires LLMs to compile real open-source projects, detailing methodology, challenges, results, and model differences.

Overview

Most LLM coding benchmarks are becoming saturated and are skewed toward cute algorithmic puzzles. They ignore the messy realities of software work: dependency hell, weird build systems, toolchain quirks, and walls of logs.

I’ll introduce CompileBench: a new eval that challenges LLMs to compile real open-source projects from scratch. Tasks range from compiling simple Linux utilities to very complex ones - compiling a big open-source project with dozens of dependencies. We stress-test models in unknown environments - can they use a toolchain from 2003 to run the build? The hardest tasks take over 30 minutes, dozens of terminal commands and require solving messy (yet realistic) problems and interacting with obscure build systems and programs.

I’ll share results, revealing model differences, surprising behaviors (LLMs attempting to cheat) and internals of the eval.

Links

https://www.compilebench.com
CompileBench benchmarks LLMs compiling complex open-source projects against dependency hell.
https://quesma.com/blog/introducing-compilebench/
CompileBench assesses LLM agents' cross-compilation and legacy code build skills via function calling.

Tech stack