Benchmarking Small Language Models Where It Actually Matters

Overview

Most SLM benchmarks answer the wrong question.
They tell you how a model scores — not whether it works.

This platform is designed for teams who care about real execution, not paper metrics.

It lets you benchmark Small Language Models on Python and Polars code generation, under strictly controlled hardware conditions, with full visibility on performance, cost, and failure modes.

Participants connect to the platform through a web interface and run benchmarks on large datasets and realistic workloads.
The backend runs inside a Docker environment and can execute jobs directly on GPUs, whether locally or on dedicated infrastructure.

Each run is configurable: quantization, decoding parameters, and runtime settings are part of the experiment, not hidden defaults.

The focus on Polars is intentional.
Most language models are very good at generating code for older, widely used libraries like pandas or NumPy. These libraries have been present in training data for years and appear in millions of examples online.

Polars is different.

It is a newer, high-performance data processing library designed for large-scale workloads, built around vectorized execution, query planning, and expression-based transformations. While it offers major performance advantages for large datasets, its programming model is significantly different from traditional Python data tools.

As a result, many models struggle with it.

They may generate code that looks correct but fails to run, produces incorrect results, or uses inefficient patterns that defeat the performance benefits of the library. This makes Polars an ideal stress test for evaluating whether a model truly understands modern data-processing workflows.

Beyond model metrics (tokens/sec, VRAM usage, GPU utilization), the platform evaluates the generated code itself:

Does it run?

Does it produce the correct result?

Is it efficient, or just “technically correct but slow”?

To make progress measurable and engaging, the platform also supports a hackathon-style workflow:

Real-time leaderboards

Full attempt history per team and per benchmark

Analytics dashboards highlighting where models systematically fail

The goal is simple:
give teams a new language to talk about model quality — not “good vs bad,” but correct, efficient, reliable, and production-ready.

If you don’t measure this, you might end up shipping models that look good in demos… and quietly break in real workloads.

Links

Tech stack