Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
CompileBench: LLMs Compiling Chrome
The talk introduces CompileBench, an evaluation that requires LLMs to compile real open-source projects, detailing methodology, challenges, results, and model differences.
Most LLM coding benchmarks are becoming saturated and are skewed toward cute algorithmic puzzles. They ignore the messy realities of software work: dependency hell, weird build systems, toolchain quirks, and walls of logs.
I’ll introduce CompileBench: a new eval that challenges LLMs to compile real open-source projects from scratch. Tasks range from compiling simple Linux utilities to very complex ones - compiling a big open-source project with dozens of dependencies. We stress-test models in unknown environments - can they use a toolchain from 2003 to run the build? The hardest tasks take over 30 minutes, dozens of terminal commands and require solving messy (yet realistic) problems and interacting with obscure build systems and programs.
I’ll share results, revealing model differences, surprising behaviors (LLMs attempting to cheat) and internals of the eval.
CompileBench benchmarks LLMs compiling complex open-source projects against dependency hell.
CompileBench assesses LLM agents' cross-compilation and legacy code build skills via function calling.
- LLMsLarge Language Models (LLMs) are Transformer-architecture deep learning systems (e.g., GPT-4, Llama 3) trained on massive text corpora to generate, summarize, and reason over human language at scale.LLMs are advanced deep learning models, specifically Generative Pre-trained Transformers (GPTs), designed to process and generate human-like text. They are trained on vast, multi-trillion-token datasets, giving them billions of parameters to learn complex linguistic patterns (syntax, semantics). This scale enables emergent capabilities: few-shot learning, code generation, and complex reasoning. Key examples include OpenAI's GPT-4, Google's Gemini, and Meta's Llama 3. LLMs power applications from conversational AI (ChatGPT) to automated content creation, fundamentally shifting how machines handle unstructured language.
- OpenRouterOpenRouter: The unified API gateway for hundreds of LLMs, providing single-endpoint access, automatic fallbacks, and cost-optimized routing across all major providers (e.g., OpenAI, Anthropic, Google).OpenRouter is your single, high-efficiency API gateway to hundreds of LLMs from over 60 providers, including OpenAI, Anthropic, and Google. We eliminate the integration complexity: one API key, one endpoint, zero code rewrites when switching between models like GPT-5 or Claude Sonnet 4.5. The platform automatically handles dynamic routing for cost-optimization, pools provider uptime for superior reliability, and consolidates all usage into a single billing dashboard. Expect minimal impact on performance: we operate at the edge, adding approximately 15ms latency, and maintain full compatibility with the OpenAI SDK.
- DockerDocker is the open-source platform that packages applications and dependencies into standardized, portable containers for consistent execution across any environment.Docker is the industry-standard containerization platform, enabling developers to build, ship, and run applications efficiently. It uses the Docker Engine (the core runtime) to create lightweight, isolated environments called containers: these units bundle an application’s code, libraries, and configuration. This self-contained approach guarantees consistency, eliminating the 'it works on my machine' problem across development, testing, and production environments (local workstations, cloud, or on-premises). Docker debuted in 2013 and now serves over 20 million developers monthly, simplifying complex workflows like CI/CD and microservices architecture by leveraging tools like Docker Hub for image sharing and Docker Compose for multi-container applications.
- PythonPython: The high-level, general-purpose language built for readability, powering everything from web backends to advanced machine learning models.Python is the high-level, general-purpose language prioritizing clear, readable syntax (via significant indentation), ensuring rapid development for any team . Its ecosystem is massive: use it for robust web development with frameworks like Django and Flask, or leverage its power in data science with libraries such as Pandas and NumPy . The Python Package Index (PyPI) provides thousands of community-contributed modules, offering immediate solutions for tasks from network programming to GUI creation . The language is actively maintained by the Python Software Foundation (PSF), with the stable release currently at Python 3.14.0 (as of November 2025) .
- CursorThe AI-native code editor designed for high-velocity development through deep LLM integration.Cursor is a fork of VS Code that embeds AI directly into the development workflow while maintaining full extension compatibility. It leverages models like Claude 3.5 Sonnet and GPT-4o to power features such as Cmd+K for inline edits and Cmd+L for codebase-wide chat. By indexing local files, Cursor provides precise context for its predictive 'Tab' completions and multi-file 'Composer' mode. This setup allows engineers to move from high-level intent to functional code without leaving the editor or losing context.
Related projects
Genaicode - programming on steroids
Poland
Live demo of Genaicode, an AI code generator, modifying a personal game in real time and covering latency,…
Compilers of Intent
Chicago
The talk examines how AI can transform any language into verified, executable code, exploring future programming practices and…
Claude Don't Code
Poland
Demonstration of using Claude Code as a terminal assistant for tasks like certificate renewal, CSV analysis, Docker debugging,…
Parsera - scraping websites without writing any scrapers
Poland
Learn how to use the Parsera library to extract structured data from any website by providing a URL…
LLM drives a web browser
New York City
This talk demonstrates an open-source interface that enables large language models to interact with web pages through a…
aius.co—the long-term memory agentic framework
Poland
The session explains aius.co’s modular agentic architecture, detailing long‑term memory persistence, compute allocation, and autonomous evolution of agents…