Modelos en local, inferencia y algo más | Manizales .

Members-Only

Recent Talks & Demos are for members only

Exclusive feed

You must be an AI Tinkerers active member to view these talks and demos.

January 22, 2025 · Manizales

Ollama Groq Local Inference

Explore deploying and optimizing LLMs locally using Ollama and Groq. Learn quantization, memory optimization, and batching for efficient local inference with real benchmarks.

Overview
Tech stack
  • Llama-2
    Llama 2 is Meta AI's powerful, openly accessible family of large language models (LLMs), featuring models from 7B to 70B parameters for research and commercial applications.
    Llama 2 is Meta AI's next-generation LLM family, released for free research and commercial use. The collection includes both pre-trained foundation models and instruction-tuned 'Chat' variants, scaling from 7 billion (7B) up to 70 billion (70B) parameters. Key technical upgrades over Llama 1 involve training on 2 trillion tokens (40% more data) and doubling the context length to 4096 tokens. The Llama-2-chat models were rigorously aligned using Reinforcement Learning from Human Feedback (RLHF), positioning them as a top-tier, openly available option for developers building advanced generative AI solutions.
  • Mistral
    Frontier AI models (LLMs) from Paris: delivering top-tier performance and efficiency through open-source innovation and optimized architecture.
    Mistral AI is the Paris-based frontier AI startup, founded in April 2023 by ex-Google DeepMind and Meta researchers (Arthur Mensch, Guillaume Lample, Timothée Lacroix). We challenge opaque 'big AI' with a mission to democratize advanced models: focusing on open-source, efficiency, and performance. Our technology, including the 123B parameter Mistral Large 2 and sparse Mixture of Experts (MoE) architecture, consistently delivers state-of-the-art results at significantly lower costs. We provide enterprise-grade solutions (Mistral AI Studio, Le Chat) for custom deployment, fine-tuning, and full data control. We are scaling fast: a $14 billion valuation confirms our position as a global leader in accessible, powerful generative AI.
  • Transformers
    The deep learning architecture that revolutionized sequence modeling (NLP, vision) by replacing recurrent units with a parallelizable multi-head self-attention mechanism.
    The Transformer: a neural network architecture introduced in the landmark 2017 paper, "Attention Is All You Need." It eliminated the sequential processing bottleneck of prior Recurrent Neural Networks (RNNs) by relying solely on self-attention, enabling massive parallelization and significantly faster training (up to 10x faster) on modern hardware. This efficiency allowed for the creation of large-scale pre-trained models: BERT (encoder-only) and the generative GPT series (decoder-only). The architecture is now foundational to all modern Large Language Models (LLMs) and drives the current state-of-the-art in AI.
  • Ollama
    Deploy and run open-source Large Language Models (LLMs) like Llama 3 and Mistral locally on your machine: achieve private, cost-effective AI via a simple command-line interface.
    Ollama is the essential tool for running LLMs locally: consider it the Docker for AI models. It packages complex models and dependencies into a single, easy-to-use application for macOS, Linux, and Windows systems. You get immediate access to models like Gemma 2 and DeepSeek-R1 via a straightforward CLI or REST API. This local-first approach guarantees data privacy and security, eliminating cloud dependency and high API costs. Ollama also optimizes performance on consumer hardware using techniques like quantization, ensuring efficient execution even on standard desktops.
  • Docker
    Docker is the open-source platform that packages applications and dependencies into standardized, portable containers for consistent execution across any environment.
    Docker is the industry-standard containerization platform, enabling developers to build, ship, and run applications efficiently. It uses the Docker Engine (the core runtime) to create lightweight, isolated environments called containers: these units bundle an application’s code, libraries, and configuration. This self-contained approach guarantees consistency, eliminating the 'it works on my machine' problem across development, testing, and production environments (local workstations, cloud, or on-premises). Docker debuted in 2013 and now serves over 20 million developers monthly, simplifying complex workflows like CI/CD and microservices architecture by leveraging tools like Docker Hub for image sharing and Docker Compose for multi-container applications.

Related projects