Technology
Mulan
Mulan is a joint acoustic-semantic embedding model that links music recordings to natural language descriptions using a dual-encoder architecture.
Developed by Google Research, Mulan maps unaligned audio and text into a shared 128-dimensional embedding space. The model leverages two distinct towers (a ResNet-50 for audio and a BERT-base for text) trained on 44 million music clips and 370,000 hours of audio. By utilizing contrastive learning, Mulan enables zero-shot music tagging and cross-modal retrieval without requiring manual annotations. This technology powers advanced music understanding tasks, allowing systems to identify complex genres or moods (e.g., 'lo-fi hip hop for studying') directly from raw waveforms.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1