Transformer Lab: Local to Distributed ML

See a demo scaling ML training from a local notebook to a GPU cluster, covering checkpoint recovery, hyperparameter sweeps, and unified experiment tracking.

Overview

Our CEO, Co-Founder Ali Asaria will be there to present. He is the main technical SME on this topic and is best positioned to share and answer questions. We assure you this will not be a pitch and the audience will get technical insights/how-tos/best practices we’ve learned working with top research labs around the world.

We’ll demo the use of the tool we built to scale from a local Jupyter notebook to a distributed training run across a cluster of GPUs. We’ll cover how we handled the “boring but critical” parts of the training workflow: automatic checkpoint recovery for spot instances, one-line hyperparameter sweeps, and unified experiment tracking that works across AMD, NVIDIA, and Apple Silicon.

Links

Tech stack