AI and music - entertaining people's ears with AI | Toronto .

Members-Only

Recent Talks & Demos are for members only

Exclusive feed

You must be an AI Tinkerers active member to view these talks and demos.

December 03, 2025 · Toronto

Xing Xing: Efficient AI Music

Presenting efficient AI music generation via vocal-conditioned diffusion and showcasing a scalable karaoke app using various off-the-market AI tools.

Overview
Links
Tech stack
  • PyTorch
    PyTorch is the open-source machine learning framework: it provides a Python-first tensor library with strong GPU acceleration and a dynamic computation graph for building deep neural networks.
    PyTorch, developed by Meta AI, is a premier open-source deep learning framework favored in both research and production environments. Its core is a powerful tensor library (like NumPy) optimized for GPU acceleration, delivering 50x or greater speedups for complex computations. The key differentiator is its 'Pythonic' design and dynamic computation graph (eager execution), which allows for rapid prototyping and simplified debugging compared to static-graph frameworks. Leveraging its Autograd system for automatic differentiation, practitioners build and train models for computer vision and NLP; major companies like Tesla (Autopilot) and Microsoft utilize PyTorch for critical AI applications.
  • Demucs
    Demucs: The state-of-the-art AI model for music source separation, using a Hybrid Transformer architecture to isolate individual audio stems.
    Demucs (Deep Extractor for Music Sources) is a powerful, open-source model developed by Meta AI (Facebook Research) for high-fidelity audio source separation. It operates directly on the raw waveform, bypassing traditional spectrogram-based methods to minimize artifacts. The latest version, Hybrid Transformer Demucs (HTDemucs), utilizes a dual-domain U-Net and cross-domain transformer to achieve a competitive 9.20 dB SDR on the MUSDB HQ test set, a benchmark for separating music into constituent tracks: vocals, drums, bass, and accompaniment. This makes it the go-to tool for musicians and researchers needing clean, fast extraction of stems for remixing or analysis.
  • FastAPI
    FastAPI is a modern, high-performance Python web framework for building APIs with automatic OpenAPI documentation.
    FastAPI is a robust, high-speed Python web framework: it is built on Starlette (for async capabilities) and Pydantic (for data validation and serialization). Leveraging standard Python 3.8+ type hints, the framework automatically generates interactive API documentation (Swagger UI/ReDoc) and enforces data validation, effectively reducing developer-induced errors by an estimated 40%. This architecture delivers performance on par with Node.js and Go, significantly increasing feature development speed (up to 300% faster). It is production-ready, fully supporting OpenAPI and JSON Schema standards for all API specifications.
  • Next
    Next.js is the full-stack React framework: it delivers high-performance web applications via hybrid rendering and powerful, Rust-based tooling.
    This is the React Framework for production: Next.js enables you to build full-stack web applications with zero configuration and maximum efficiency. It supports a hybrid rendering approach (Server-Side Rendering, Static Site Generation, and Incremental Static Regeneration) for optimal speed and SEO performance. Key features include React Server Components, Server Actions for running server code directly, and the App Router for advanced routing and nested layouts. Developed by Vercel, it leverages Rust-based tools like Turbopack and the Speedy Web Compiler for the fastest possible builds and a superior developer experience.
  • OpenAI Whisper
    Whisper is OpenAI's robust, open-source Automatic Speech Recognition (ASR) system, trained on 680,000 hours of diverse audio.
    This is Whisper: a high-performance, general-purpose ASR model from OpenAI. It was trained on a massive 680,000 hours of multilingual, multitask data, resulting in exceptional robustness against accents, background noise, and technical language. The model is a Transformer sequence-to-sequence architecture, engineered for multiple tasks: multilingual transcription, speech-to-English translation, and language identification. Developers leverage the open-source code and various model sizes (tiny, base, small, medium, large) to balance transcription speed with near human-level accuracy for diverse applications.

Related projects