Deep Neural Networks for Sound Event Detection
The objective of this thesis is to develop novel classiﬁcation and feature learning techniques for the task of sound event detection (SED) in real-world environments. Throughout their lives, humans experience a consistent learning process on how to assign meanings to sounds. Thanks to this, most of the humans can easily recognize the sound of a thunder, dog bark, door bell, bird singing etc. In this work, we aim to develop systems that can automatically detect the sound events commonly present in our daily lives. Such systems can be utilized in e.g. contextaware devices, acoustic surveillance, bio-acoustical and healthcare monitoring, and smart-home cities. In this thesis, we propose to apply the modern machine learning methods called deep learning for SED. The relationship between the commonly used timefrequency representations for SED (such as mel spectrogram and magnitude spectrogram) and the target sound event labels are highly complex. Deep learning methods such as deep neural networks (DNN) utilize a layered structure of units to extract features from the given sound representation input with increased abstraction at each layer. This increases the network’s capacity to eﬃciently learn the highly complex relationship between the sound representation and the target sound event labels. We found that the proposed DNN approach performs signiﬁcantly better than the established classiﬁer techniques for SED such as Gaussian mixture models. In a time-frequency representation of an audio recording, a sound event can often be recognized as a distinct pattern that may exhibit shifts in both dimensions. The intra-class variability of the sound events may cause to small shifts in the frequency domain content, and the time domain shift results from the fact that a sound event can occur at any time for a given audio recording. We found that convolutional neural networks (CNN) are useful to learn shift-invariant ﬁlters that are essential for robust modeling of sound events. In addition, we show that recurrent neural networks (RNN) are eﬀective in modeling the long-term temporal characteristics of the sound events. Finally, we combine the convolutional and recurrent layers in a single classiﬁer called convolutional recurrent neural networks (CRNN), which emphasizes the beneﬁts of both and provides state-of-the-art results in multiple SED benchmark datasets. Aside from learning the mappings between the time-frequency representations and the sound event labels, we show that deep learning methods can also be utilized to learn a direct mapping between the the target labels and a lower level representation such as the magnitude spectrogram or even the raw audio signals. In this thesis, the feature learning capabilities of the deep learning methods and the empirical knowledge on the human auditory perception are proposed to be integrated through the means of layer weight initialization with ﬁlterbank coeﬃcients. This results with an optimal, ad-hoc ﬁlterbank that is obtained through gradient based optimization of the original coeﬃcients to improve the SED performance.