Audio research group - Tampere University - Content analysis of sounds

Computational auditory scene analysis (CASA) aims to separate and recognize mixtures of sound sources present in the auditory scene in a similar manner to humans. Humans can easily segregate and recognize one sound source from an acoustic mixture, such as a certain voice from a busy background including other people talking and music. Machine listening systems try to achieve this perceptual feature using e.g. sound source recognition techniques.

Acoustic scene classification

The goal of acoustic scene classification is to classify a test recording into one of the predefined classes that characterize the environment in which it was recorded — for example, "park", "home", "office". This information would enable wearable devices to provide better service to users' needs, e.g., by adjusting the mode of operation based on an acoustic scene or location type. Listening tests conducted in our research group showed that humans are able to recognize everyday acoustic scenes in 70% of cases on average. Current state-of-art automatic scene recognition methods can reach this level of accuracy with a small and well-defined set of classes.

Bibliography

conference

Acoustic Scene Classification Using Higher-Order Ambisonic Features
{Marc C.} Green, Sharath Adavanne, Damian Murphy, Tuomas Virtanen, 2019

conference

Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification
Konstantinos Drossos, Paul Magron, Tuomas Virtanen, 2019

conference

City Classification from Multiple Real-World Sound Scenes
Helen L. Bear, Toni Heittola, Annamaria Mesaros, Emmanouil Benetos, Tuomas Virtanen, 2019

conference

Unsupervised Adversarial Domain Adaptation for Acoustic Scene Classification
Shayan Gharib, Konstantinos Drossos, Emre Cakir, Dmitriy Serdyuk, Tuomas Virtanen, 2018

conference

Unsupervised Adversarial Domain Adaptation for Acoustic Scene Classification
Shayan Gharib, Konstantinos Drossos, Emre Cakir, Dmitriy Serdyuk, Tuomas Virtanen, 2018

article

Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge
Annamaria Mesaros, Toni Heittola, Emmanouil Benetos, Peter Foster, Mathieu Lagrange, Tuomas Virtanen, Mark D. Plumbley, 2018

conference

A multi-device dataset for urban acoustic scene classification
Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, 2018

conference

Acoustic scene classification: An overview of dcase 2017 challenge entries
Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, 2018

article

conference

A convolutional neural network approach for acoustic scene classification
Michele Valenti, Stefano Squartini, Aleksandr Diment, Giambattista Parascandolo, Tuomas Virtanen, 2017

conference

DCASE 2017 challenge setup: tasks, datasets and baseline system
Annamaria Mesaros, Toni Heittola, Aleksandr Diment, {Benjamin Martinez} Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, Tuomas Virtanen, 2017

mastersthesis

Acoustic scene classification using clustering-based acoustic models
Natalia Ibáñez Sáez, 2017

book

Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)
Tuomas Virtanen, Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Emmanuel Vincent, Emmanouil Benetos, Benjamin Martinez Elizalde, 2017

conference

Assessment of human and machine performance in acoustic scene classification: DCASE 2016 case study
Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, 2017

mastersthesis

Convolutional neural networks for acoustic scene classiﬁcation
Michele Valenti, 2016

conference

TUT Database for Acoustic Scene Classification and Sound Event Detection
Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, 2016

conference

DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks
Michele Valenti, Aleksandr Diment, Giambattista Parascandolo, Stefano Squartini, Tuomas Virtanen, 2016

conference

Acoustic context recognition for mobile devices using a reduced complexity SVM
Daniele Battaglino, Annamaria Mesaros, Ludovick Lepauloux, Laurent Pilati, Nicholas Evans, 2015

article

Context-Dependent Sound Event Detection
Toni Heittola, Annamaria Mesaros, Antti Eronen, Tuomas Virtanen, 2013

conference

Sound event detection and context recognition
Toni Heittola, Annamaria Mesaros, Tuomas Virtanen, Antti Eronen, 2011

conference

Audio context recognition using audio event histograms
Toni Heittola, Annamaria Mesaros, Antti Eronen, Tuomas Virtanen, 2010

article

Audio-Based Context Recognition
Antti Eronen, Vesa Peltonen, Juha Tuomi, Anssi Klapuri, Seppo Fagerlund, Timo Sorsa, Gaëtan Lorho, Jyri Huopaniemi, 2006

mastersthesis

Audio-Based Context Tracking
Juha Tuomi, 2004

conference

Audio-based context awareness - Acoustic modeling and perceptual evaluation
Antti Eronen, Juha Tuomi, Anssi Klapuri, Seppo Fagerlund, Timo Sorsa, Gaëtan Lorho, Jyri Huopaniemi, 2002

conference

Computational auditory scene recognition
Vesa Peltonen, Juha Tuomi, Anssi Klapuri, Jyri Huopaniemi, Timo Sorsa, 2002

conference

Recognition of everyday auditory scenes: potentials, latencies and cues
Vesa Peltonen, Antti Eronen, Mikko Parviainen, Anssi Klapuri, 2001

conference

Recognition of Everyday Auditory Scenes: Potentials, Latencies and Clues
Vesa Peltonen, Antti Eronen, Mikko Parviainen, Anssi Klapuri, 2001

mastersthesis

Computational Auditory Scene Recognition
Vesa Peltonen, 2001

Sound event detection and recognition

Audio streams, such as broadcast news, meeting recordings, and personal videos contain sounds from a wide variety of sound sources. These streams include audio events related to human presence, such as speech, laughter, or coughing or sounds of animals, objects, nature, or situations. The detection of these events is useful, e.g., for automatic tagging in audio indexing, automatic sound analysis for audio segmentation, or audio context classification.

An acoustic scene is characterized by the presence of individual sound events. In this respect, we may want to manage a multi-class description of our audio or video files by detecting the categories of sound events that occur in a file. For example, one may want to tag a holiday recording as being on the "beach", playing with the "children" and the "dog", right before the "storm" came. These are different level annotations, and while the beach as a context could be inferred from acoustic events like waves, wind, and water splashing, the audio events "dog barking" or "children" should be explicitly recognized, because such acoustic event may appear in other contexts, too.

Sound event detection and classification aim to process the acoustic signal and convert it into symbolic descriptions of the corresponding sound events present in the acoustic signal. In recent years, we have worked to extend the sound event detection task to a comprehensive set of event-annotated audio material from everyday environments. Most of the everyday auditory scenes are usually complex in sound events, having multiple overlapping sound events active at the same time, and this presents a special challenge to the sound event detection process.

Bibliography

conference

Sound Event Detection with Depthwise Separable and Dilated Convolutions
Konstantinos Drossos, {Stylianos Ioannis} Mimilakis, Shayan Gharib, Yanxiong Li, Tuomas Virtanen, 2020

conference

Joint Measurement of Localization and Detection of Sound Events
Annamaria Mesaros, Sharath Adavanne, Archontis Politis, Toni Heittola, Tuomas Virtanen, 2019

article

Sound Event Detection in the DCASE 2017 Challenge
Annamaria Mesaros, Aleksandr Diment, Benjamin Elizalde, Toni Heittola, Emmanuel Vincent, Bhiksha Raj, Tuomas Virtanen, 2019

conference

Audio-Based Epileptic Seizure Detection
{M. N. Istiaq} Ahsan, C. Kertesz, A. Mesaros, T. Heittola, A. Knight, T. Virtanen, 2019

conference

Language Modelling for Sound Event Detection with Teacher Forcing and Scheduled Sampling
Konstantinos Drossos, Shayan Gharib, Paul Magron, Tuomas Virtanen, 2019

conference

Sound Event Envelope Estimation in Polyphonic Mixtures
Irene Martín-Morató, Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, Maximo Cobos, {Francesc J.} Ferri, 2019

article

Sound Event Detection in the DCASE 2017 Challenge
Annamaria Mesaros, Aleksandr Diment, Benjamin Elizalde, Toni Heittola, Emmanuel Vincent, Bhiksha Raj, Tuomas Virtanen, 2019

article

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks
Sharath Adavanne, Archontis Politis, Joonas Nikunen, Tuomas Virtanen, 2018

article

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks
Sharath Adavanne, Archontis Politis, Joonas Nikunen, Tuomas Virtanen, 2018

article

conference

Using sequential information in polyphonic sound event detection
Guangpu Huang, Toni Heittola, Tuomas Virtanen, 2018

inbook

Introduction to sound scene and event analysis
Tuomas Virtanen, Mark D. Plumbley, Dan Ellis, 2017

article

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
Emre Cakir, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, Tuomas Virtanen, 2017

conference

Active Learning for Sound Event Classification by Clustering Unlabeled Data
Zhao Shuyang, Toni Heittola, Tuomas Virtanen, 2017

conference

Sound event detection using spatial features and convolutional recurrent neural network
Sharath Adavanne, Pasi Pertila, Tuomas Virtanen, 2017

conference

Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network
Sharath Adavanne, Tuomas Virtanen, 2017

book

Computational analysis of sound scenes and events
Tuomas Virtanen, Mark D. Plumbley, Dan Ellis, 2017

conference

Assessment of support vector machines and convolutional neural networks to detect snoring using Emfit mattress
Jose Martin Perez-Macias, Sharath Adavanne, Jari Viik, Alpo Värri, Sari-Leena Himanen, Mirja Tenhunen, 2017

conference

Convolutional recurrent neural networks for bird audio detection
Emre Cakir, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos, Tuomas Virtanen, 2017

conference

Assessment of human and machine performance in acoustic scene classification: DCASE 2016 case study
Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, 2017

book

conference

Transfer Learning of Weakly Labelled Audio
Aleksandr Diment, Tuomas Virtanen, 2017

conference

Sound event detection using spatial features and convolutional recurrent neural network
Sharath Adavanne, Pasi Pertila, Tuomas Virtanen, 2017

conference

Stacked convolutional and recurrent neural networks for bird audio detection
Sharath Adavanne, Konstantinos Drossos, Emre Cakir, Tuomas Virtanen, 2017

conference

Recurrent Neural Networks for Polyphonic Sound Event Detection in Real Life Recordings
Giambattista Parascandolo, Heikki Huttunen, Tuomas Virtanen, 2016

conference

TUT Database for Acoustic Scene Classification and Sound Event Detection
Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, 2016

conference

Sound event detection in multichannel audio using spatial and harmonic features
Sharath Adavanne, Giambattista Parascandolo, Pasi Pertila, Toni Heittola, Tuomas Virtanen, 2016

techreport

Sound event detection in real life audio
Giambattista Parascandolo, Pasi Pertila, Toni Heittola, Tuomas Virtanen, 2016

article

Metrics for polyphonic sound event detection
Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, 2016

conference

Filterbank Learning for Deep Neural Network Based Polyphonic Sound Event Detection
Emre Cakir, Ezgi Can Ozan, Tuomas Virtanen, 2016

conference

Sound event detection in multichannel audio using spatial and harmonic features
Sharath Adavanne, Giambattista Parascandolo, Pasi Pertila, Toni Heittola, Tuomas Virtanen, 2016

conference

Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations
Annamaria Mesaros, Toni Heittola, Onur Dikmen, Tuomas Virtanen, 2015

conference

Multi-label vs. combined single-label sound event detection with deep neural networks
Emre Cakir, Toni Heittola, Heikki Huttunen, Tuomas Virtanen, 2015

article

Investigating the Impact of Sound Angular Position on the Listener Affective State
K. Drossos, Andreas Floros, A. Giannakoulopoulos, Nikolaos Kanellopoulos, 2015

mastersthesis

Recurrent neural networks for polyphonic sound event detection
Giambattista Parascandolo, 2015

conference

Polyphonic sound event detection using multi label deep neural networks
Emre Cakir, Toni Heittola, Heikki Huttunen, Tuomas Virtanen, 2015

conference

Automatic recognition of environmental sound events using all-pole group delay features
Aleksandr Diment, Emre Cakir, Toni Heittola, Tuomas Virtanen, 2015

article

Evaluating the Impact of Sound Events’ Rhythm Characteristics to Listener’s Valence
Konstantinos Drossos, Andreas Floros, Katia Lida Kermanidis, 2015

mastersthesis

Multilabel Sound Event Classification with Neural Networks
Emre Cakir, 2014

mastersthesis

Acoustic Event Classification Using Deep Neural Networks
Oguzhan Gencoglu, 2014

conference

Recognition of Acoustic Events Using Deep Neural Networks
Oguzhan Gencoglu, Tuomas Virtanen, Heikki Huttunen, 2014

conference

Supervised Model Training for Overlapping Sound Events Based on Unsupervised Source Separation
Toni Heittola, Annamaria Mesaros, Tuomas Virtanen, Moncef Gabbouj, 2013

article

Context-Dependent Sound Event Detection
Toni Heittola, Annamaria Mesaros, Antti Eronen, Tuomas Virtanen, 2013

conference

Query-by-example retrieval of sound events using an integrated similarity measure of content and label
Annamaria Mesaros, Toni Heittola, Kalle Palomäki, 2013

article

Context-Dependent Sound Event Detection
Toni Heittola, Annamaria Mesaros, Antti Eronen, Tuomas Virtanen, 2013

conference

Analysis of acoustic-semantic relationship for diversely annotated real-world audio data
Annamaria Mesaros, Toni Heittola, Kalle Palomäki, 2013

conference

Sound event detection and context recognition
Toni Heittola, Annamaria Mesaros, Tuomas Virtanen, Antti Eronen, 2011

conference

Latent Semantic Analysis in Sound Event Detection
Annamaria Mesaros, Toni Heittola, Anssi Klapuri, 2011

conference

Sound Event Detection in Multisource Environments Using Source Separation
Toni Heittola, Annamaria Mesaros, Tuomas Virtanen, Antti Eronen, 2011

conference

Acoustic event detection in real life recordings
Annamaria Mesaros, Toni Heittola, Antti Eronen, Tuomas Virtanen, 2010

conference

TUT Acoustic Event Detection System 2007
Toni Heittola, Anssi Klapuri, 2007

Audio tagging and captioning

Audio tagging is a sound classification task and refers to identifying the classes that a human can assign to a sound segment. That is, given a sound segment (e.g. a sound file) as an input to an audio tagging system, the output of the latter should be an indication of the classes assigned to the input sound segment. The emphasis on this task is the ability to identify the classes independently of their amount (i.e. large amount of classes to be detected), their frequency of appearing (i.e. classes with an imbalanced and arbitrary frequency of appearance), the length of input audio segment (i.e. arbitrary length of input audio), and the differences of the acoustic conditions/acoustic channels (i.e. recordings from different environments and/or different equipment). In ARG we are working on audio tagging, trying to tackle all the above challenges. At the same time, we constantly working towards gaining more insight and knowledge about the mechanisms involved in human and machine perception and improving the performance of our algorithms and methods for audio tagging.

Audio captioning is a recently introduced research direction and it is not to be confused with subtitling. Audio captioning is the task where a system takes a sound segment as an input and outputs a textual description of the content of the audio file. For example, "People talking in a crowded restaurant". It can be considered as an inter-modality translation, where the information represented in one modality (i.e. "sound") is translated to another one (i.e. "text"). It is a complicated task where the method accounted for audio captioning must be able to simultaneously do audio event recognition (to recognize the sound events happening in the sound segment), recognition of spatio-temporal relations and associations (to identify movement and relative location of sound sources), acoustic scene recognition (to identify the acoustic scene, the audio background), and represent all these in a sentence that can actually make sense. In ARG we are happy to have the very first ever publication on this task. Since this task is pretty new, stay tuned for more and exciting results!

Bibliography

conference

Crowdsourcing a Dataset of Audio Captions
Samuel Lipping, Konstantinos Drossos, Tuomas Virtanen, 2019

article

conference

Automated Audio Captioning with Recurrent Neural Networks
Konstantinos Drossos, Sharath Adavanne, Tuomas Virtanen, 2017

conference

General audio classification

Audio classification can be applied to any problem where it is useful and meaningful to gain some knowledge about the audio content regarding the particular problem at hand. In addition to audio context and sound event recognition, these kinds of audio classification problems are encountered for example in content-based database retrieval tasks.

Classifying audio into separate pre-determined classes provides useful means for browsing large audio/video databases. As text-based indexing of audio files is very laborious and time-consuming, content-based audio classification has been studied rather widely in the audio signal processing field. Supervised learning methods, both the commonly existing ones as well as some modified ones, are actively being researched in the Audio Research Team, in order to apply such classification and retrieval successfully to realistic-size audio databases. In a general case, the audio content to be classified need not be limited to any specific type of audio, such as speech or music, as long as the training dataset used for training the used classifier(s) is representative enough to cover all the specified class types.