Audio research group - Tampere University - Source separation & signal enhancement

Source separation means the tasks of estimating the signal produced by an individual sound source from a mixture signal consisting of several sources. This is a very fundamental problem in many audio signal processing tasks since analysis and processing of isolated sources can be done with much better accuracy than the processing of mixtures of sounds. Human listeners will also benefit from source separation in many applications, for example in hearing aids.

Singing voice and music separation

The separation of instrumental tracks from a music song is an important topic in music signal processing since it serves as a preprocessing step for higher-level applications. Indeed, many musical processings perform better when applied to individual tracks rather than directly on the mixture. For instance, extracting the singing voice from the musical accompaniment is helpful for lyrics transcription and karaoke. Separating pitched and percussive sounds (such as drums) is useful for rhythm analysis and time-stretching applications. The group has pioneered several kinds of research on music source separation, notably through statistical and nonnegative matrix factorization models, as well as with modern deep learning architectures.

Bibliography

article

Online Spectrogram Inversion for Low-Latency Audio Source Separation
Paul Magron, Tuomas Virtanen, 2020

article

Deep Learning for Audio Signal Processing
Hendrik Purwins, Bo Li, Tuomas Virtanen, Jan Schüller, Shuo-yiin Chang, Tara Sainath, 2019

article

Examining the Mapping Functions of Denoising Autoencoders in Singing Voice Separation
Stylianos Ioannis Mimilakis, Konstantinos Drossos, Estefania Cano, Gerald Schuller, 2019

conference

Bayesian anisotropic Gaussian model for audio source separation
Paul Magron, Tuomas Virtanen, 2018

conference

Expectation-maximization algorithms for Itakura-Saito nonnegative matrix factorization
Paul Magron, Tuomas Virtanen, 2018

conference

Towards Complex Nonnegative Matrix Factorization with the Beta-Divergence
Paul Magron, Tuomas Virtanen, 2018

conference

Estimation of time-varying room impulse responses of multiple sound sources from observed mixture and isolated source signals
Joonas Nikunen, Tuomas Virtanen, 2018

conference

On modeling the STFT phase of audio signals with the von Mises distribution
Paul Magron, Tuomas Virtanen, 2018

article

Complex ISNMF: a phase-aware model for monaural audio source separation
Paul Magron, Tuomas Virtanen, 2018

article

Multichannel Blind Sound Source Separation using Spatial Covariance Model with Level and Time Differences and Non-Negative Matrix Factorization
{Julio Jose} {Carabias Orti}, Joonas Nikunen, Tuomas Virtanen, Pedro Vera-Candeas, 2018

conference

Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation
Paul Magron, Konstantinos Drossos, Stylianos Ioannis Mimilakis, Tuomas Virtanen, 2018

conference

A Recurrent Encoder-Decoder Approach With Skip-Filtering Connections for Monaural Singing Voice Separation
Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, Gerald Schuller, 2017

article

Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking
Joonas Nikunen, Aleksandr Diment, Tuomas Virtanen, 2017

conference

Lévy NMF for robust nonnegative source separation
P. Magron, R. Badeau, A. Liutkus, 2017

conference

Low Latency Sound Source Separation using Convolutional Recurrent Neural Networks
Gaurav Naithani, Tom Barker, Giambattista Parascandolo, Lars Bramsløw, Niels Henrik Pontoppidan, Tuomas Virtanen, 2017

conference

Low-Latency Sound Source Separation Using Deep Neural Networks
Gaurav Naithani, Giambattista Parascandolo, Tom Barker, Niels Henrik Pontoppidan, Tuomas Virtanen, 2016

article

Blind Separation of Audio Mixtures Through Nonnegative Tensor Factorization of Modulation Spectrograms
Tom Barker, Tuomas Virtanen, 2016

conference

Low-Latency Sound-Source-Separation using Non-Negative Matrix Factorisation with Coupled Analysis and Synthesis Dictionaries
Tom Barker, Tuomas Virtanen, Niels Henrik Pontoppidan, 2015

conference

Semi-supervised non-negative tensor factorisation of modulation spectrograms for monaural speech separation
T. Barker, T. Virtanen, 2014

article

Online Blind Speech Separation using Multiple Acoustic Speaker Tracking and Time-Frequency Masking
Pasi Pertilä, 2013

conference

Permutation Alignment Of Frequency-Domain Ica By The Maximization Of Intra-Source Envelope Correlations
Joonas Nikunen, Tuomas Virtanen, Pasi Pertilä, Miikka Vilermo, 2012

Spatial sound source separation

Human hearing can separate sounds based on their perceived direction and the same principles are utilized in spatial sound source separation to enhance and separate sounds coming from a certain direction or location. Spatial sound source separation utilizes directional features obtained by using multiple microphones (microphone array) to capture the sound scene. The separation methods can be divided into separation mask estimation and direct source signal estimation conditioned to a certain direction or spatial location. Often the spatial sound source separation requires solving the problem of source direction of arrival estimation and source trajectory estimation jointly or as a preprocessing step. For further details please refer to spatial audio research pages.

Bibliography

article

Online Blind Speech Separation using Multiple Acoustic Speaker Tracking and Time-Frequency Masking
Pasi Pertilä, 2013

Object-based coding of spatial audio

Separation of sound objects from spatial audio mixtures allows encoding the source/object signal and its spatial parameters (mixing) separately as side information. Representing spatial audio mixture as objects with interpretable direction is efficient from an encoding perspective and object-based representations allow modification of the observed sound scene during the spatial synthesis. Previously the group has conducted pioneering research on object-based multichannel audio upmixing.

Bibliography

conference

Multichannel audio upmixing based on non-negative tensor factorization representation
Joonas Nikunen, Tuomas Virtanen, Miikka Vilermo, 2011

conference

Object-Based Audio Coding Using Non-Negative Matrix Factorization for the Spectrogram Representation
Joonas Nikunen, Tuomas Virtanen, 2010

conference

Noise-to-Mask Ratio Minimization by Weighted Non-negative Matrix Factorization
Joonas Nikunen, Tuomas Virtanen, 2010

Speech enhancement and separation

One specific task in source separation is the enhancement of the target speaker's voice from background noise or interfering voices. The applications of speech separation range from preprocessing for automatic speech recognition (ASR) all way to provide ease of everyday communication for hearing-impaired listeners. Machine and deep learning are used for training models to learn to distinct and separate target speech over the interfering sounds preserving only the signal components corresponding to the target speech. The goal of speech enhancement, in general, is improving the intelligibility of speech either for humans or machines to listen for. The success of speech enhancement is often measured and predicted using objective criteria, while our groups' work also involves listening tests in determining the actual intelligibility of separated speech with human listeners.