Instruct Lab LLM Evaluation Playbook

A reproducible workflow creates synthetic CS data, fine‑tunes LLMs, and evaluates models with perplexity, token‑level PRF/F1, exact match, SBERT similarity, and length diagnostics.

Overview

We are contributing to an open-source project by IBM x Red Hat called Instruct Lab. What we learned is that fine-tuning is easy to run and hard to trust. We’ll share a reproducible, taxonomy-driven workflow for generating synthetic CS datasets (Assembly/RISC-V, DSA, Theory of Computation), fine-tuning models, and benchmarking multiple models on the same synthetic data.

We’ll demo our standalone evaluation framework (perplexity, token-level PRF/F1, exact match, SBERT semantic similarity, plus length diagnostics) and show results across base vs. tuned models and cross-model comparisons using a constant dataset. You’ll see where tuning helps (and where it just makes outputs longer), how teacher model choice (InstructLab simple/full vs. GPT/Claude) affects downstream students, and what goes wrong (overfitting, NaN loss, EM brittleness) with concrete fixes.

Attendees will leave with a better understanding of the following: how to scale synthetic data, add new knowledge taxonomies, run longer LoRA schedules responsibly, and report paired, apples-to-apples comparisons.

Repo: https://github.com/CSC392-CSC492-Building-AI-ML-systems/Autumn2025InstructLab

Links

https://github.com/CSC392-CSC492-Building-AI-ML-systems/Autumn2025I...

Tech stack