Gamified Reality | San Francisco .

Members-Only

Recent Talks & Demos are for members only

Exclusive feed

You must be an AI Tinkerers active member to view these talks and demos.

April 30, 2024 · San Francisco

Gamified Reality

See a real-time gamified reality app using fast small multimodal models to understand world activities and award points, showcasing efficient computer vision.

Video
Overview
Tech stack
  • GPT-V
    OpenAI's fifth-generation multimodal LLM: GPT-5 unifies fast processing and deep reasoning, achieving 94.6% accuracy on advanced mathematics.
    GPT-5 (Generative Pre-trained Transformer 5) is OpenAI's flagship multimodal foundation model, launched August 7, 2025. This unified architecture automatically routes queries between 'Fast Mode' for routine tasks and 'Deep Reasoning' for complex challenges, eliminating guesswork for developers. Performance metrics confirm significant gains: the model achieves 74.9% accuracy on real-world coding tasks (SWEBench) and sets a new state-of-the-art score on the GDPval knowledge work benchmark, performing at or above human expert level in 70.9% of comparisons. Access the system via the OpenAI API or through ChatGPT Plus (using the gpt-5-thinking-pro variant).
  • NVIDIA Tesla T4
    The T4 is NVIDIA's universal, low-profile, 70-watt GPU accelerator, purpose-built for high-efficiency AI inference, video processing, and scale-out cloud workloads.
    This is the NVIDIA T4: a powerful, energy-efficient accelerator optimized for mainstream data center environments. Built on the Turing architecture, it features 2,560 CUDA Cores and 320 multi-precision Turing Tensor Cores, delivering up to 130 INT8 TOPS for AI inference. Crucially, its single-slot, low-profile PCIe form factor and 70W power envelope enable high-density deployment in nearly any server. The T4 excels at real-time AI tasks (like conversational AI), high-throughput video transcoding (handling up to 38 full-HD streams), and virtual desktop infrastructure (VDI), making it a versatile workhorse for the hybrid cloud.
  • Multimodal Models
    AI systems that process and integrate multiple data modalities—like text, image, and audio—to achieve human-like, context-aware understanding.
    Multimodal models fuse disparate data types (text, video, audio) into a single, unified representation, enabling advanced reasoning and generation across modalities. Key players like Google's Gemini 2.5 Pro handle massive 2-million-token contexts, processing entire codebases or two hours of video footage at once. This capability drives real-world applications: a GPT-4o-powered agent can analyze a customer's voice tone and a screenshot simultaneously, and a vision-language model can generate a detailed image description from a simple text prompt. The technology moves AI beyond single-input limitations, delivering a more holistic and versatile intelligence.
  • Computer Vision
    Computer Vision (CV) uses deep learning and neural networks (e.g., CNNs) to enable machines to interpret visual data: identifying objects, people, and patterns with high accuracy.
    CV is a core AI subfield that actively replicates human sight, processing raw images and video via deep learning models to extract meaningful, actionable data. It is critical for major industries: powering object detection in autonomous vehicles, enabling defect detection in manufacturing quality control, and assisting with diagnostics in medical imaging. This technology drives significant operational efficiency and safety across sectors, with the overall market expected to reach $58.29 billion by 2030 (Grand View Research). We leverage CV to automate tasks that traditionally required human visual inspection.
  • Object Detection
    Object Detection is the computer vision task that locates and classifies specific objects (e.g., cars, pedestrians) in images or video, using models like YOLO or Faster R-CNN to draw precise bounding boxes.
    Object Detection employs deep learning algorithms, primarily Convolutional Neural Networks (CNNs), to perform two simultaneous tasks: object localization and classification. Unlike simple image classification, it draws a precise bounding box around each detected instance, identifying all objects within the frame (e.g., all 80 categories in the COCO dataset). Key architectures like the single-stage YOLO (You Only Look Once) prioritize speed for real-time applications (e.g., autonomous driving), while two-stage models like Faster R-CNN often deliver higher Mean Average Precision (mAP). This technology is critical for real-world systems: autonomous vehicles use it to track pedestrians and stop signs, and industrial quality control leverages it for defect identification on assembly lines.

Related projects