Technology

LoRA/GRPO (fine-tuning from execution traces)

LoRA/GRPO optimizes large language models by applying Group Relative Policy Optimization to low-rank adapters using execution traces as verifiable reward signals.

This stack combines Low-Rank Adaptation (LoRA) with Group Relative Policy Optimization (GRPO) to refine model reasoning without the heavy compute of full parameter updates. By analyzing execution traces (step-by-step logs of code or logic), the system scores outputs based on objective success rather than subjective preference. DeepSeek-V3 and similar architectures use this method to slash VRAM requirements while boosting performance on benchmarks like GSM8K and HumanEval. It is a high-efficiency play for teams needing specialized reasoning capabilities on consumer-grade hardware (24GB VRAM) using precise, trace-based feedback loops.

https://arxiv.org/abs/2402.03300

1 project · 1 city

Related technologies

Anthropic API 57 asyncio 3 Hugging Face 36 Hugging Face (local models) 1 OpenAI/Anthropic/Google/OpenRouter APIs 1 OpenAI API 507 Playwright 27 Playwright (browser agent) 1 Python 613 vLLM 30

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

MARSYS: Multi-Agent Workflows

Lausanne Apr 30

Python OpenAI API