LLM Probing for Immediate Inference

This talk details using novel LLM heads to measure internal states, accelerating inference by bypassing token generation, a serious technique for performance improvement.

Overview

‘Training’ is an ongoing challenge but ‘Inference’ will be the dominant performance challenge of AI going forward, signalled by Groq acquisition by Nvidia. Autogeneration is slow and expensive, and it’s now the dominant ‘bottleneck’. ‘Probing’ or adding novel architectures onto LLMs (Heads) can accelerated inference by measuring the ‘state’ of an LLM side-stepping the requirement to generate tokens.

I don’t have a super fancy presentation or clean GitHub yet, it’s just lab notes and demo.

It actually works, this is serious, not just a toy.

Tech stack