Computational Auditory Scene Recognition
Abstract
An acoustic environment surrounding a listener, an auditory scene, can provide contextual cues that enable the recognition of the scene. This thesis concerns the problem of computational auditory scene recognition, which is a subproblem of computational auditory scene analysis. Computational auditory scene analysis refers to the computational analysis of an acoustic environment, and the recognition of distinct sound events in it. In this study, the focus is not in analyzing and recognizing discrete sound events (although they may be used in the recognition process), but in the classification of acoustic environments as whole. This thesis covers all the different phases of a study that was made at the Signal Processing Laboratory of Tampere University of Technology: a literature review on auditory scene recognition and related fields of research, acoustic measurements that were made in a number of everyday auditory environments, design and implementation of the audio database access software, a listening test examining human abilities in auditory scene recognition, audio signal classification theory, algorithm development and simulations. The core of this thesis is in the computational audio classification and signal processing algorithm development part. Auditory scene recognition involves correct grouping of similar environments, feature selection and extraction, and the use of a suitable classification algorithm. A crucial step in solving the problem is to determine appropriate features that can discriminatebetween the acoustic data associated with pre-defined scene classes. The conducted listening tests show that, on average, humans are able to recognize 25 different scenes with 70 % accuracy. The scenes included everyday outside and inside environments,such as streets, market places, restaurants, and family homes. The performance of the computational classification methods was investigated by conducting Matlab simulations. The bestobtained recognition rate for 13 different scenes was 56%, where the classified scenes wereselected so that from each one there were at least three recordings from different locations. We also did an experiment of recognizing more general classes (meta-classes), and for certain categorizations of the scenes we obtained relatively good classification results. For example, themeta-class car vs. other was classified correctly in 95% of the cases.
Research areas- Year:
- 2001