Learning vocal mode classifiers from heterogeneous data sources

Shuyang, Zhao; Heittola, Toni; Virtanen, Tuomas
Abstract

This paper targets on a generalized vocal mode classifier (speech/singing) that works on audio data from an arbitrary data source. However, previous studies on sound classification are commonly based on cross-validation using a single dataset, without considering the cases that training and testing data are recorded in mismatched condition. Experiments revealed a big difference between homogeneous recognition scenario and heterogeneous recognition scenario, using a new dataset TUT-vocal-2016. In the homogeneous recognition scenario, the classification accuracy using cross-validation on TUT-vocal-2016 was 95.5%. In heterogeneous recognition scenario, seven existing datasets were used as training material and TUT-vocal-2016 was used for testing, the classification accuracy was only 69.6%. Several feature normalization methods were tested to improve the performance in heterogeneous recognition scenario. The best performance (96.8%) was obtained using the proposed subdataset-wise normalization.

Keywords

sound classification; vocal mode; heterogeneous data sources; feature normalization

Year:
2017
Book title:
2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
Pages:
16–20
Address:
United States
ISBN:
978-1-5386-1631-4
DOI:
10.1109/WASPAA.2017.8169986