This page is an on-line demo of our recent research results on monaural harmonic/percussive source separation. Full presentation of results and method is in our paper entitled
"Harmonic-Percussive Source Separation with Deep Neural Networks and Phase Recovery"
and presented at the 16th IEEE International Workshop on Acoustic Signal Enhancement (IWAENC). Get the BibTex record here.
Code based on and . Don't hesitate to contact us!
Our work is about separating the harmonic from the percussive instruments/components that exist in a music mixture. That is, given a single channel (i.e. monaural) musical mixture (i.e. music song), our method separates the percussive and harmonic sounds that exist in the mixture. For example, in a band setup like guitar, bass, drums, percussion (e.g. conga), and vocals, our method separates the drums and the percussion from all the rest. Hence, the name harmonic/percussive source separation (HPSS).
For convenience, below you can find a brief introduction to the above mention methods. It has to be noted that we offer code for both of our methods and pre-trained weights (where applicable), in order to help reproducibility.
So, feel free and use our methods, visit, star, and clone out GitHub repositories, and enjoy separating sources!
Below you can see an illustration of our proposed method for HPSS.
MaD TwinNet is based on the Masker-Denoiser architecture, augmented with the Twin Network. Thus, the "MaD" is from the "Masker-Denoiser" and TwinNet from the Twin Network. The role of MaD TwinNet in this work is to perform the separation of the percussive and the harmonic components. For a general presentation of the MaD TwinNet, you can check at the corresponding paper and demo.
The Masker is the first component of MaD TwinNet and accepts as an input the magnitude spectrogram of the mixture. Then, the Masker predicts and applies a time-frequency mask to its input and outputs a first estimate of the magnitude spectrogram of the percussive components. This estimate of the percussive components is then given as an input to the Denoiser.
The Denoiser predicts and applies a time-frequency denoising filter to the estimated percussive components. This filter aims at removing interferences, artifacts, and (in general) any other noise introduced by the separation process from the Masker.
After the application of the denoising filter by the Denoiser, the now cleaned estimated of the magnitude spectrogram of the percussive components can be used to estimate the harmonic components. This result in having separated the percussive and harmonic components.
The estimated harmonic components are given as an input to the PU algorithm to enhance them more, by applying improved phase recovery techniques.
The most common approach when separating music signals by employing magnitude spectrogram is to use the phase of the mixture. This approach is equivalent to assuming that each time-frequency bin of the short-time Fourier transform (STFT) contains information for only one source. In a realistic scenario, such as the harmonic/percussive case, this assumption does no longer hold since the sources are strongly overlapping in time and frequency.
The PU algorithm consists in predicting the phase of the harmonic source by using a sinusoidal model. Then, from this initial estimate, an iterative procedure is applied to minimize the mixing error and yield the final sources estimates. We applied the PU algorithm on the predictions of the harmonic components, in order to reduce the interferences from the percussive sources.
The iterative process is illustrated in the image bellow, and more details can be found on the corresponding website.
Below you can actually listen the performance of our method! We have a set of songs and for each one, we offer for listening the original mixture (i.e. the song), the original voice, and the voice as is separated by our method.
We have resulting audio from two different settings. These settings correspond to different set of hyper-parameters for MaD TwinNet.
Must be mentioned that we did not do any kind of extra post-processing to the files. You will just hear the actual, unprocessed, output of our method.
|Signe Jakobsen||What Have You Done To Me||Rock Singer-Songwriter|
|Fergessen||Back From The Start||Melodic Indie Rock|
|James Elder & Mark M Thompson||The English Actor||Indie Pop|
|Leaf||Come around||Atmospheric Indie Pop|
In other words, from what data our method learned, on what data it is tested, and how well it performed from an objective perspective?
To benchmark our complete method (i.e. MaD TwinNet plus PU algorithm), we compared the obtained results against a typical method (kernel additive model, KAM) and against MaD TwinNet but using the phase of the mixture.
You can see the information about the data at the "Dataset" section and information about the obtained objective results at the "Objective results" section.
In order to train our method, we used the development subset of the Demixing Secret Dataset (DSD), which consists of 50 mixtures with their corresponding sources, plus music stems from MedleyDB.
For testing our method, we used the testing subset of the DSD, consisting of 50 mixtures and their corresponding sources.
We objectively evaluated our method using the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifacts ratio (SAR). The results can be seen at the table below.
|Setting 1||MaDTwinNet & mix phase||03.35||04.65||06.10||08.62||14.22||10.75||05.99||09.44||08.43|
|MaDTwinNet & PU||03.35||04.66||06.08||08.58||14.45||10.59||05.97||09.55||08.34|
|Setting 2||MaDTwinNet & mix phase||03.60||04.73||06.07||08.70||12.84||11.78||06.15||08.79||08.92|
|MaDTwinNet & PU||03.59||04.76||06.00||08.69||13.11||11.57||06.14||08.94||08.78|
We would like to kindly acknowledge all those that supported and helped us for this work.