SLM Fine-tuning on 16GB CPU

This talk details a practical workflow for supervised fine-tuning a small language model on a standard 16GB RAM CPU-only laptop, covering dataset formatting, training, and inference.

Overview

I will present a practical workflow for performing supervised fine-tuning of a small language model (such as SmolLM2-360M-Instruct) on an ordinary laptop (CPU-only, 16GB RAM). The talk covers instruction dataset formatting, supervised fine-tuning, and inference. A brief before-and-after comparison will show the effect of the fine-tuning. In particular:
(1) instruction dataset formatting - The dataset will be a jsonl (One JSON object per line) file, and each line contains the mapping between requirement and python function (that is, the function name and its arguments). Each line is in the following format:
===========
{“instruction”: , "input": , "output": "<|begin|>\n### Requirement:\n\n### Parsed:\n<function name, actual argument name and value>\n### Python Code:\n<function_name(argument_name=argument_value, ...)>\n<|end|>"} =========== (2) supervised fine-tuning - the training pipeline contains the following steps: ===========

Handle commandline arguments – argparse.ArgumentParser()
Load tokenizer – tokenizer = AutoTokenizer.from_pretrained(…)
Load model – model = AutoModelForCausalLM.from_pretrained(…, device_map=”cpu”, low_cpu_mem_usage=True)
Prepare model for PEFT/LoRA using get_peft_model(model, peft.LoraConfig(…))
Load dataset (JSONL with ‘instruction’ and ‘response’ keys) – dataset = load_dataset(…)
Tokenize dataset and create labels with prompt masking – tokenized = dataset.map(…) and customized functions
Convert to torch tensors (because the trainer expects tensors) – transformers.DataCollatorForLanguageModeling(…)
Setup TrainingArguments using training_args = TrainingArguments(…)
Run training ( ~20 minutes “per epoch” for a small instruction dataset with 100 samples on a CPU-only average laptop) – transformers.Trainer.train()(…)
Save the LoRA-based fine-tuned model (weights and configuration) – model.save_pretrained(…)
Save the tokenizer and tokenization configuration (things needed to turn text into model input) – tokenizer.save_pretrained(…)

===========

(3) inference and evaluation - the inference pipeline contains the following steps:

Load tokenizer – AutoTokenizer.from_pretrained(…)
Load base model – AutoModelForCausalLM.from_pretrained(…, torch_dtype=torch.float16, device_map=”auto”)
Attach LoRA adapter – model = PeftModel.from_pretrained(base_model, args.output_dir)
Generate the formatted prompt – prompt = “< begin >\n” + + “\n”
Tokenize the input – inputs = tokenizer(prompt, return_tensors=”pt”).to(model.device)
Perform inference – outputs = model.generate(…)
Generate the text response – response = tokenizer.decode(outputs[0], skip_special_tokens=True)

===========

Tech stack