Recurrent neural networks for polyphonic sound event detection


Deprecated: implode(): Passing glue string after array is deprecated. Swap the parameters in /var/www/html/arg/administrator/components/com_jresearch/helpers/publications.php on line 269

Deprecated: implode(): Passing glue string after array is deprecated. Swap the parameters in /var/www/html/arg/administrator/components/com_jresearch/helpers/publications.php on line 269

Deprecated: implode(): Passing glue string after array is deprecated. Swap the parameters in /var/www/html/arg/administrator/components/com_jresearch/helpers/publications.php on line 269
Parascandolo, Giambattista

Abstract

The objective of this thesis is to investigate how a deep learning model called recurrent neural network (RNN) performs in the task of detecting overlapping sound events in real life environments. Examples of such sound events include dog barking, footsteps, and crowd applauding. When several sound sources are active simultaneously, as it is often the case in everyday contexts, identifying individual sound events from their polyphonic mixture is a challenging task. Other factors such as noise and distortions contribute to making even more difficult to explicitly implement a computer program to solve the detection task. We present an approach to polyphonic sound event detection in real life recordings based on a RNN architecture called bidirectional long short term memory (BLSTM). A multilabel BLSTM RNN is trained to map the time-frequency representation of a mixture signal consisting of sounds from multiple sources, to binary activity indicators of each event class. Our method is tested on two large databases of recordings, both containing sound events from more than 60 different classes, and in one case from 10 different everyday contexts. Furthermore, in order to reduce overfitting we propose to use several data augmentation techniques: time stretching, sub-frame time shifting, and block mixing. The proposed approach outperforms the previous state-of-the-art method, despite using half of the parameters, and the results are further largely improved using the block mixing data augmentation technique. Overall, for the first dataset our approach reports an average F1-score of 65.5% on 1 second blocks and 64.7% on single frames, a relative improvement over previous state-of-the-art approach of 6.8% and 15.1% respectively. For the second dataset our system reports an average F1- score of 84.4% on 1 second blocks and 85.1% on single frames, a relative improvement over the baseline approach of 38.4% and 35.9% respectively.

Research areas

Year:
2015