Code Quality: Hiding LLM Non-determinism

Learn how to hide LLM non-determinism and quantitatively optimize accuracy for AI code review systems, a crucial learning for LLM-based applications.

Python git GPT-4 Code Quality LLM providers GPT-3 Llama-2 PaLM 2 BLOOM BERT RoBERTa

Overview

“Code Quality” is a system that posts feedback comments on your pull requests. We’ll take a deep dive into how we hide non-determinism caused by LLMs and how we optimize the LLM’s accuracy.

Links

https://www.aikido.dev/code-quality
Aikido AI analyzes pull requests for logic, stability, and performance flaws.

Tech stack

Python

Python: The high-level, general-purpose language built for readability, powering everything from web backends to advanced machine learning models.

Python is the high-level, general-purpose language prioritizing clear, readable syntax (via significant indentation), ensuring rapid development for any team . Its ecosystem is massive: use it for robust web development with frameworks like Django and Flask, or leverage its power in data science with libraries such as Pandas and NumPy . The Python Package Index (PyPI) provides thousands of community-contributed modules, offering immediate solutions for tasks from network programming to GUI creation . The language is actively maintained by the Python Software Foundation (PSF), with the stable release currently at Python 3.14.0 (as of November 2025) .

https://python.org

View projects
git

Git is the distributed version control system (DVCS) that tracks source code changes, ensuring data integrity and enabling non-linear development workflows.

Git is the free, open-source distributed version control system (DVCS) created by Linus Torvalds in 2005 to manage the Linux kernel. Engineered for speed and efficiency, it handles projects from small to extremely large, storing the entire 1.4 million commit history of the Linux project in only 5.5 GB . Its core design supports non-linear development (branching/merging) and guarantees data integrity via cryptographic hashing. According to a 2022 Stack Overflow survey, 96% of professional developers use Git, making it the industry standard for collaborative software development .

https://git-scm.com

View projects
GPT-4

GPT-4 is OpenAI’s large multimodal model: it processes both text and image inputs, delivering human-level performance on complex professional and academic benchmarks.

This is OpenAI’s latest milestone in scaling deep learning: a large multimodal model accepting both text and image inputs. It demonstrates a significant capability leap over its predecessor, scoring in the top 10% on a simulated bar exam (GPT-3.5 scored in the bottom 10%). The model handles nuanced instructions and long-form content, supporting context windows up to 32,768 tokens (32K model). This capacity allows processing up to 25,000 words in a single, complex prompt. GPT-4 is engineered for enhanced reliability, steerability, and advanced reasoning across diverse tasks.

https://platform.openai.com/docs/models/gpt-4

View projects
Code Quality
LLM providers

LLM providers are the core entities (OpenAI, Google, Anthropic) that develop, host, and offer large language models (GPT-4, Gemini, Claude) via API for enterprise integration.

LLM providers deliver the foundational models that power generative AI applications: they handle the massive training and inference infrastructure. Major players include OpenAI (GPT-4o), Google (Gemini 2.5 Pro), and Anthropic (Claude Opus), alongside cloud platforms like Amazon Bedrock and Microsoft Azure, which offer managed access to a model catalog. Businesses access these models primarily through secure APIs, enabling use cases like advanced customer service, automated code generation, and complex data summarization. The market is highly competitive, focusing on key metrics: lower cost-per-token, ultra-low latency, and expanding context windows (up to 1 million tokens in some models).

https://cloud.google.com/vertex-ai

View projects
GPT-3

A 175-billion parameter autoregressive language model that masters complex tasks through few-shot learning.

OpenAI debuted GPT-3 in 2020: a transformer-based engine trained on 570GB of filtered text. It utilizes 175 billion parameters to execute diverse functions (including Python scripting and logical reasoning) using only natural language prompts. This architecture removed the requirement for task-specific fine-tuning: establishing the foundation for modern tools like GitHub Copilot and the initial ChatGPT release.

https://openai.com/index/gpt-3-powers-next-generation-of-apps/

View projects
Llama-2

Llama 2 is Meta AI's powerful, openly accessible family of large language models (LLMs), featuring models from 7B to 70B parameters for research and commercial applications.

Llama 2 is Meta AI's next-generation LLM family, released for free research and commercial use. The collection includes both pre-trained foundation models and instruction-tuned 'Chat' variants, scaling from 7 billion (7B) up to 70 billion (70B) parameters. Key technical upgrades over Llama 1 involve training on 2 trillion tokens (40% more data) and doubling the context length to 4096 tokens. The Llama-2-chat models were rigorously aligned using Reinforcement Learning from Human Feedback (RLHF), positioning them as a top-tier, openly available option for developers building advanced generative AI solutions.

https://ai.meta.com/llama/

View projects
PaLM 2

Google's versatile large language model optimized for advanced reasoning, multilingual translation, and coding across four distinct scales.

PaLM 2 powers 25+ Google products (including Gemini and Workspace) using a Transformer-based architecture trained on a massive corpus of 100+ languages. It excels in specialized tasks: solving complex math problems, generating high-quality code, and passing professional-level exams. Developers deploy the model via the PaLM API in four sizes: Gecko, Otter, Bison, and Unicorn. Gecko is lightweight enough to run locally on mobile devices (offline), while Unicorn handles the most complex, data-heavy reasoning tasks at scale.

https://ai.google/discover/palm2/

View projects
BLOOM

A 176-billion parameter open-access multilingual language model built by the BigScience research collective.

BLOOM is the result of a year-long collaboration involving 1,000+ researchers from 70+ countries. It supports 46 natural languages and 13 programming languages: it provides a high-performance alternative to proprietary models. The model was trained on the Jean Zay supercomputer in France using the 1.6-terabyte ROOTS dataset (a massive collection of diverse text sources). By providing full access to its weights and training process, BLOOM enables global developers to build and audit AI tools without the restrictions of closed-door APIs.

https://huggingface.co/bigscience/bloom

View projects
BERT

BERT (Bidirectional Encoder Representations from Transformers) is a foundational, pre-trained NLP model that uses a Transformer encoder to process text bidirectionally, capturing full word context for superior language understanding.

BERT is a revolutionary language representation model introduced by Google AI Language in 2018. It is built on the Transformer architecture and distinguishes itself by being deeply bidirectional: it processes the entire sequence of words (left and right context) simultaneously, unlike previous unidirectional models. This capability is achieved through a Masked Language Model (MLM) pre-training objective. The model, released in sizes like BERTBASE (110 million parameters) and BERTLARGE (340 million parameters), dramatically improved the state-of-the-art across 11+ Natural Language Processing tasks, including question answering (SQuAD) and sentiment analysis, establishing a new baseline for the field.

https://arxiv.org/abs/1810.04805

View projects
RoBERTa

RoBERTa (Robustly Optimized BERT Pretraining Approach) is a high-performance language model from Facebook AI that significantly outperforms BERT by optimizing the pretraining strategy, not the core architecture.

RoBERTa is a robustly optimized version of the BERT model, developed by researchers at Facebook AI in 2019. The team conducted a replication study, proving BERT was undertrained and could achieve state-of-the-art results with a refined recipe: they removed the Next Sentence Prediction (NSP) objective, implemented dynamic masking, and scaled up training dramatically. Specifically, RoBERTa trained for 500K steps (up from 100K) on a massive 160GB of text data (ten times BERT’s data) using much larger batch sizes (up to 8K). This optimized approach yielded superior performance on major benchmarks like GLUE, RACE, and SQuAD, establishing RoBERTa as a benchmark for subsequent language model development.

https://arxiv.org/abs/1907.11692

View projects