Object-based Modeling of Audio for Coding and Source Separation
Abstract
This thesis studies several data decomposition algorithms for obtaining an object-based representation of an audio signal. The estimation of the representation parameters are coupled with audio-specific criteria, such as the spectral redundancy, sparsity, perceptual relevance and spatial position of sounds. The objective is to obtain an audio signal representation that is composed of meaningful entities called audio objects that reflect the properties of real-world sound objects and events. The estimation of the object-based model is based on magnitude spectrogram redundancy using non-negative matrix factorization with extensions to multichannel and complex-valued data. The benefits of working with object-based audio representations over the conventional time-frequency bin-wise processing are studied. The two main applications of the object-based audio representations proposed in this thesis are spatial audio coding and sound source separation from multichannel microphone array recordings. In the proposed spatial audio coding algorithm, the audio objects are estimated from the multichannel magnitude spectrogram. The audio objects are used for recovering the content of each original channel from a single downmixed signal, using time-frequency filtering. The perceptual relevance of modeling the audio signal is considered in the estimation of the parameters of the object-based model, and the sparsity of the model is utilized in encoding its parameters. Additionally, a quantization of the model parameters is proposed that reflects the perceptual relevance of each quantized element. The proposed object-based spatial audio coding algorithm is evaluated via listening tests and comparing the overall perceptual quality to conventional time-frequency block-wise methods at the same bitrates. The proposed approach is found to produce comparable coding efficiency while providing additional functionality via the object-based coding domain representation, such as the blind separation of the mixture of sound sources in the encoded channels. For the sound source separation from multichannel audio recorded by a microphone array, a method combining an object-based magnitude model and spatial covariance matrix estimation is considered. A direction of arrival-based model for the spatial covariance matrices of the sound sources is proposed. Unlike the conventional approaches, the estimation of the parameters of the proposed spatial covariance matrix model ensures a spatially coherent solution for the spatial parameterization of the sound sources. The separation quality is measured with objective criteria and the proposed method is shown to improve over the state-of-the-art sound source separation methods, with recordings done using a small microphone array.
Research areas- Year:
- 2015
- Note:
- Awarding institution:Tampereen teknillinen yliopisto - Tampere University of Technology<br/>Submitter:Submitted by Kaisa Kulkki (kaisa.kulkki@tut.fi) on 2015-01-12T07:19:54Z No. of bitstreams: 1 nikunen_1276.pdf: 4672129 bytes, checksum: 539dae433a5c01133caff847f1dd9de0 (MD5)<br/>Submitter:Approved for entry into archive by Kaisa Kulkki (kaisa.kulkki@tut.fi) on 2015-01-12T07:20:19Z (GMT) No. of bitstreams: 1 nikunen_1276.pdf: 4672129 bytes, checksum: 539dae433a5c01133caff847f1dd9de0 (MD5)<br/>Submitter:Made available in DSpace on 2015-01-12T07:20:19Z (GMT). No. of bitstreams: 1 nikunen_1276.pdf: 4672129 bytes, checksum: 539dae433a5c01133caff847f1dd9de0 (MD5)