Technology

Evals

Evals is OpenAI's open-source framework: systematically benchmark Large Language Models (LLMs) and LLM-powered systems for performance, accuracy, and stability.

This is OpenAI's open-source framework for rigorous LLM evaluation, providing a structured, reproducible method to test models like GPT-4 against specific criteria (accuracy, reasoning, instruction-following). The platform includes a public registry of benchmarks and allows developers to create custom evals: use proprietary data to match application needs without public exposure. Evals is essential for continuous quality assurance (QA), catching regressions, and ensuring stability before any production deployment.

https://github.com/openai/evals

3 projects · 3 cities

Related technologies

LLMs 82 Benchmark dataset 1 BERT 179 Claude 133 CLI 6 datasets 6 GPT-3 191 GPT-4 528 Keras 74 LangChain 437 Langfuse 13 Library 3 MetGuessr 1 ONNX 82 PyTorch 262 scikit-learn 82 TensorFlow 90

Recent Talks & Demos

Showing 1-3 of 3

Members-Only

Agentic AI Evaluation

Singapore Aug 12

LangChain Langfuse

Claude: Finetuning Art Recognition

New York City Oct 21

Claude MetGuessr

Configuration-Based LLM Evals

Austin Jul 11

LLMs CLI