Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
SLM Fine-tuning on 16GB CPU
This talk details a practical workflow for supervised fine-tuning a small language model on a standard 16GB RAM CPU-only laptop, covering dataset formatting, training, and inference.
I will present a practical workflow for performing supervised fine-tuning of a small language model (such as SmolLM2-360M-Instruct) on an ordinary laptop (CPU-only, 16GB RAM). The talk covers instruction dataset formatting, supervised fine-tuning, and inference. A brief before-and-after comparison will show the effect of the fine-tuning. In particular:
(1) instruction dataset formatting - The dataset will be a jsonl (One JSON object per line) file, and each line contains the mapping between requirement and python function (that is, the function name and its arguments). Each line is in the following format:
===========
{“instruction”:
- Handle commandline arguments –
argparse.ArgumentParser()
- Load tokenizer – tokenizer =
AutoTokenizer.from_pretrained(…)
- Load model – model =
AutoModelForCausalLM.from_pretrained(…, device_map=”cpu”, low_cpu_mem_usage=True)
- Prepare model for PEFT/LoRA using get_peft_model(model,
peft.LoraConfig(…))
- Load dataset (JSONL with ‘instruction’ and ‘response’ keys) – dataset = load_dataset(…)
- Tokenize dataset and create labels with prompt masking – tokenized =
dataset.map(…) and customized functions
- Convert to torch tensors (because the trainer expects tensors) –
transformers.DataCollatorForLanguageModeling(…)
- Setup TrainingArguments using training_args = TrainingArguments(…)
- Run training ( ~20 minutes “per epoch” for a small instruction dataset with 100 samples on a CPU-only average laptop) –
transformers.Trainer.train()(…)
- Save the LoRA-based fine-tuned model (weights and configuration) –
model.save_pretrained(…)
- Save the tokenizer and tokenization configuration (things needed to turn text into model input) –
tokenizer.save_pretrained(…)
===========
(3) inference and evaluation - the inference pipeline contains the following steps:
- Load tokenizer –
AutoTokenizer.from_pretrained(…)
- Load base model –
AutoModelForCausalLM.from_pretrained(…, torch_dtype=
torch.float16, device_map=”auto”)
- Attach LoRA adapter – model =
PeftModel.from_pretrained(base_model,
args.output_dir)
Generate the formatted prompt – prompt = “< begin >\n” + + “\n” - Tokenize the input – inputs = tokenizer(prompt, return_tensors=”pt”).to(
model.device)
- Perform inference – outputs =
model.generate(…)
- Generate the text response – response =
tokenizer.decode(outputs[0], skip_special_tokens=True)
===========
- LangChainThe open-source framework for building and deploying reliable, data-aware Large Language Model (LLM) applications.LangChain is the essential framework for engineering LLM-powered applications: it simplifies connecting models (like GPT-4 or Claude) to external data, computation, and APIs. The platform provides a modular set of components—Chains, Agents, Tools, and Memory—allowing developers to quickly build complex workflows like Retrieval-Augmented Generation (RAG) pipelines and sophisticated conversational agents. Its Python and JavaScript libraries, combined with LangChain Expression Language (LCEL), offer a standardized interface for rapid prototyping and moving applications to production with confidence.
- TransformersThe deep learning architecture that revolutionized sequence modeling (NLP, vision) by replacing recurrent units with a parallelizable multi-head self-attention mechanism.The Transformer: a neural network architecture introduced in the landmark 2017 paper, "Attention Is All You Need." It eliminated the sequential processing bottleneck of prior Recurrent Neural Networks (RNNs) by relying solely on self-attention, enabling massive parallelization and significantly faster training (up to 10x faster) on modern hardware. This efficiency allowed for the creation of large-scale pre-trained models: BERT (encoder-only) and the generative GPT series (decoder-only). The architecture is now foundational to all modern Large Language Models (LLMs) and drives the current state-of-the-art in AI.
- PyTorchPyTorch is the open-source machine learning framework: it provides a Python-first tensor library with strong GPU acceleration and a dynamic computation graph for building deep neural networks.PyTorch, developed by Meta AI, is a premier open-source deep learning framework favored in both research and production environments. Its core is a powerful tensor library (like NumPy) optimized for GPU acceleration, delivering 50x or greater speedups for complex computations. The key differentiator is its 'Pythonic' design and dynamic computation graph (eager execution), which allows for rapid prototyping and simplified debugging compared to static-graph frameworks. Leveraging its Autograd system for automatic differentiation, practitioners build and train models for computer vision and NLP; major companies like Tesla (Autopilot) and Microsoft utilize PyTorch for critical AI applications.
- OpenAI APIOpenAI API: Your direct gateway to cutting-edge AI models (GPT-4o, DALL-E 3, Whisper), enabling scalable, multimodal intelligence integration into any application.The OpenAI API provides authenticated, programmatic access to a powerful suite of generative AI models. Developers leverage REST endpoints and official libraries (Python, Node.js) to integrate capabilities like advanced text generation (GPT-4o), image creation (DALL-E 3), and speech-to-text transcription (Whisper). This platform is engineered for scale, supporting millions of daily requests for tasks from complex reasoning to real-time customer support agents, ensuring your application gets reliable, state-of-the-art intelligence.
- PEFTPEFT (Parameter-Efficient Fine-Tuning) is a set of techniques for rapidly adapting large pre-trained models (LLMs, vision models) to new tasks by updating only a small, critical subset of parameters.PEFT is your solution for scaling model customization without the massive resource drain of full fine-tuning. It works by freezing the majority of the original model's weights and introducing a minimal number of new, trainable parameters: think LoRA or adapters. This approach drastically cuts down on computational cost and storage. For example, a full Stable Diffusion fine-tune is gigabytes; a PEFT adapter like LoRA can be a mere 8.8MB, yet deliver comparable performance. The Hugging Face PEFT library integrates seamlessly with Transformers and Diffusers, making it accessible to train and deploy state-of-the-art models even on consumer-grade hardware.
Related projects
Fine-tune your own Llama 2 to replace GPT-3.5/4
Seattle
Learn to fine-tune Llama 2 to replace GPT-3.5/4. This talk covers the process, its strengths/weaknesses, and how to…
From Local Prototyping to Distributed Clusters: An Open Source Platform for ML Research Teams
Toronto
See a demo scaling ML training from a local notebook to a GPU cluster, covering checkpoint recovery, hyperparameter…
NanoGPT-inference: How to build LLM inference from scratch
Brussels
Learn to build efficient and economical LLM inference from scratch. This talk reveals techniques to speed up inference,…
Transformer Lab: Training LLMs and Diffusion Models Locally
Calgary
Live demo of Transformer Lab: fine‑tune large language models and run diffusion evaluations locally on a MacBook Air,…
AI-powered language quiz generator
Waterloo
This talk details building an AI language quiz generator, showing API instantiation, UI flow, and practical lessons learned…
404 Language Not Found: Building a Tutor from Scratch (and Tears)
Montreal
Building an AI tutor for Tashelhit, a low-resource Amazigh dialect, using Python and LLMs to create lessons from…