MaD TwinNet On-line Demo

This page is an on-line demo of our recent research results on monaural musical sound source separation. Full presentation of results and method is in our paper entitled:

"MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation"

and presented at the IEEE International Joint Conference on Neural Networks/World Congress on Computational Intelligence (IJCN/WCCI) 2018 . Get the BiBTeX record here.

See our paper on arXiv!

Get the weights from Zenodo!

Get the results from Zenodo!

Code based on

Introduction

Our work is about separating a sigle musical source from a musical mixture. That is, given a single channel (i.e. monaural) musical mixture (i.e. a music song), our method extracts a single source from that mixture, as closely as possible to the original track of that single source.

For evaluating our method, we focus on the singing voice separation. That is, given a musical mixture, we separate the singing voice from the mixture as closely as possible to the original track of the singing voice.

Not caring for details, just want the demo? Click here and go to the demonstration!

MaD TwinNet

Our method is based on our previously proposed Masker-Denoiser architecture, augmented with the recently proposed Twin Network. Thus, the "MaD" is from the "Masker-Denoiser" and TwinNet from the Twin Network.

Below you can see an illustration of our method.

Illustration of the MaD TwinNet

MaD in a nutshell

Our method accepts as an input the audio mixture. After the initial pre-processing (e.g. time-frequency transformation, more info at the paper), the masker accepts as an input the magnitude spectrogram of the mixture.

Then, the Masker predicts and applies a time-frequency mask to its input and outputs a first estimate of the magnitude spectrogram of the singing voice. This first estimate is then given as an input to the Denoiser.

The Denoiser predicts and applies a time-frequency denoising filter. This filter aims at removing interferences, artifacts, and (in general) any other noise introduced to the first estimate of the singing voice.

After the application of the denoising filter by the denoiser, the now cleaned estimated of the magnitude spectrogram of the singing voice is turned back to audio samples by the output processing.

And what about TwinNet?

The recently proposed TwinNet is an effective way to make a recurrent neural network (RNN) to anticipate for upcoming patterns, i.e. to make the RNN to be able to respond better to the future, by learning global structures of the singing voice.

We adopted the TwinNet as a way to make the MaD learn to anticipate the strong temporal patterns and structures of music. Demo section

Original voice

Song information

Artist	Title	Genre
Leaf	Come around	Atmospheric Indie Pop

Data and objective results

In other words, from what data our method learned, on what data it is tested, and how well it performed from an objective perspective?

Dataset

In order to train our method, we used the development subset of the Demixing Secret Dataset (DSD), which consists of 50 mixtures with their corresponding sources, plus music stems from MedleyDB.

For testing our method, we used the testing subset of the DSD, consisting of 50 mixtures and their corresponding sources.

Objective results

We objectively evaluated our method using the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifacts ratio (SAR). The results can be seen at the table below.

The objective evaluation results of our method

SDR	SIR	SAR
4.57	8.17	5.95

Acknowledgements

We would like to kindly acknowledge all those that supported and helped us for this work.

Part of the computations leading to these results was performed on a TITAN-X GPU donated by NVIDIA to K. Drossos
K. Drossos and T. Virtanen wish to acknowledge the CSC-IT Center for Science, Finland, for computational resources
D. Serdyuk would like to acknowledge the support of the following agencies for research funding and computing support:
S.-I. Mimilakis is supported by the European Union's H2020 Framework Programme (H2020-MSCA-ITN-2014) under grant agreement no 642685 MacSeNet
The authors would like to thank P. Magron and G. Naithani (TUT, Finland) for their valuable comments and feedback during the writing process