JetBrains AI Codebase Benchmarks

Exploring JetBrains Long Code Arena benchmarks, we'll demonstrate project‑wide code completion and library‑based generation, discuss context strategies, and evaluate new syntax and n‑gram metrics.

Overview

We are contributing to an open-source project by JetBrains Research called Long Code Arena (LCA). LCA consists of 6 benchmarks that evaluate how AI models perform in evaluating different aspects of a developer’s entire project. The two benchmarks we have been working on include the project-level code completion and library-based code generation. The project-level code completion uses the full project as context to generate the next line of code in a file. The library-based code generation tests the model’s ability to generate appropriate code relying on library methods. We evaluated several models and measured their performance using key benchmark-specific metrics. More specifically, we employ various techniques to enhance model performance. Some strategies included how we provide the prompts and additional context. Additionally, we contributed more metrics, such as syntax matching and n-gram matching, to assess the model output quality more effectively. Our project is crucial because it enables us to experiment with various context collection techniques based on the source datasets provided by JetBrains.

Links

https://github.com/CSC392-CSC492-Building-AI-ML-systems/Autumn2025-...
A Python benchmark suite for large-context code generation and repair tasks.

Tech stack