This page is an on-line demo of our recent research results on monaural musical sound source separation. Full presentation of results and method is in our paper entitled:
"MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation"
and presented at the IEEE International Joint Conference on Neural Networks/World Congress on Computational Intelligence (IJCN/WCCI) 2018 . Get the BiBTeX record here.
Our work is about separating a sigle musical source from a musical mixture. That is, given a single channel (i.e. monaural) musical mixture (i.e. a music song), our method extracts a single source from that mixture, as closely as possible to the original track of that single source.
For evaluating our method, we focus on the singing voice separation. That is, given a musical mixture, we separate the singing voice from the mixture as closely as possible to the original track of the singing voice.
Not caring for details, just want the demo? Click here and go to the demonstration!
Our method is based on our previously proposed Masker-Denoiser architecture, augmented with the recently proposed Twin Network. Thus, the "MaD" is from the "Masker-Denoiser" and TwinNet from the Twin Network.
Below you can see an illustration of our method.
Our method accepts as an input the audio mixture. After the initial pre-processing (e.g. time-frequency transformation, more info at the paper), the masker accepts as an input the magnitude spectrogram of the mixture.
Then, the Masker predicts and applies a time-frequency mask to its input and outputs a first estimate of the magnitude spectrogram of the singing voice. This first estimate is then given as an input to the Denoiser.
The Denoiser predicts and applies a time-frequency denoising filter. This filter aims at removing interferences, artifacts, and (in general) any other noise introduced to the first estimate of the singing voice.
After the application of the denoising filter by the denoiser, the now cleaned estimated of the magnitude spectrogram of the singing voice is turned back to audio samples by the output processing.
The recently proposed TwinNet is an effective way to make a recurrent neural network (RNN) to anticipate for upcoming patterns, i.e. to make the RNN to be able to respond better to the future, by learning global structures of the singing voice.
We adopted the TwinNet as a way to make the MaD learn to anticipate the strong temporal patterns and structures of music.
Below you can actually listen the performance of our method! We have a set of songs and for each one, we offer for listening the original mixture (i.e. the song), the original voice, and the voice as is separated by our method.
Must be mentioned that we did not do any kind of extra post-processing to the files. You will just hear the actual, unprocessed, output of our method.
|What Have You Done To Me
|Back From The Start
|Melodic Indie Rock
|James Elder & Mark M Thompson
|The English Actor
|Atmospheric Indie Pop
In other words, from what data our method learned, on what data it is tested, and how well it performed from an objective perspective?
In order to train our method, we used the development subset of the Demixing Secret Dataset (DSD), which consists of 50 mixtures with their corresponding sources, plus music stems from MedleyDB.
For testing our method, we used the testing subset of the DSD, consisting of 50 mixtures and their corresponding sources.
We objectively evaluated our method using the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifacts ratio (SAR). The results can be seen at the table below.
We would like to kindly acknowledge all those that supported and helped us for this work.