The Sound of Pixels

We introduce PixelPlayer, a system that, by watching large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel. Our approach capitalizes on the natural synchronization of the visual and audio modalities to learn models that jointly parse sounds and images, without requiring additional manual supervision.

The system is trained with a large number of videos containing people playing instruments in different combinations, including solos and duets. No supervision is provided on what instruments are present on each video, where they are located, or how they sound. During test time, the input to the system is a video showing people playing different instruments, and the mono auditory input. Our system performs audio-visual source separation and localization, splitting the input sound signal into N sound channels, each one corresponding to a different instrument category. In addition, the system can localize the sounds and assign a different audio wave to each pixel in the input video.

New! Follow-up Projects

Check out our recent follow-up projects:

Chuang Gan, Hang Zhao, Peihao Chen, David Cox, Antonio Torralba. Self-supervised Moving Vehicle Tracking with Stereo Sound (ICCV 2019) arXiv:1910.11760
Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio Torralba. Self-Supervised Audio-Visual Co-Segmentation (ICASSP 2019) arXiv:1904.09013
Hang Zhao, Chuang Gan, Wei-Chiu Ma, Antonio Torralba. The Sound of Motions (ICCV 2019) arXiv:1904.05979

Interactive Demo

In this interactive demo, you can click on different video locations on the right, to hear the sound component associated with the selected location. The input video is shown on the left. (The demo is not well supported on the mobile end yet.)

Input video to PixelPlayer:

Click on a pixel to hear its sound:

Input video to PixelPlayer:

Click on a pixel to hear its sound:

Input video to PixelPlayer:

Click on a pixel to hear its sound:

Video clips credit to original Youtube videos: [1] [2] [3]

Paper

Dataset

MUSIC dataset of instrument recordings,
Version 1.0 available now on GitHub.

Code

Code is released on GitHub.


    @InProceedings{Zhao_2018_ECCV,
      author = {Zhao, Hang and Gan, Chuang and Rouditchenko, Andrew and Vondrick, Carl and McDermott, Josh and Torralba, Antonio},
      title = {The Sound of Pixels},
      booktitle = {The European Conference on Computer Vision (ECCV)},
      month = {September},
      year = {2018}
    }

New! Follow-up Projects

Interactive Demo

Paper

Dataset

Code

Related Work