Audio Research Group conducts research in many areas related to audio, speech, and music signals. The research contains both basic research on the core techniques such as acoustic pattern classification, as well as applied research targeting specific applications.
Content analysis of general audio is concerned with the analysis of audio for automatic extraction of relevant information such as acoustic scene characteristics and sound sources present. Applications of audio content analysis range from simple classification tasks for enabling context awareness, to security and surveillance systems based on detected sounds of interest, to the organization, indexing, tagging, and querying of large audio databases.
- Acoustic scene classification
Classifying recordings into classes that characterize the environment in which it was recorded.
- Sound event detection and recognition
Recognizing individual sound events present in the auditory scene.
- Audio tagging
Automatically applying descriptive tags for audio signals
- Audio captioning
Automatically applying textual description to audio signals
- Audio-based database query
Querying large audio databases based on audio content
- General audio classification
Classifying audio signals based on many different aspects of the sound.
Microphone arrays provide a link between the physical locations of sound objects for the computer software and can allow capturing the sound field. Applications of microphone arrays include physical location determination such as speaker localization and speaker position tracking, signal enhancement, and separation.
- Sound localization
Automatic determination of the sound source physical location
Automatic localization and synchronization of the device microphones
- Speaker position tracking
Tracking the location and detecting the appearance and disappearance of multiple speakers that are temporally overlapping
- Shooter localization
Shooter localization and estimation of bullet trajectory, caliber, and speed
Source separation means the tasks of estimating the signal produced by an individual sound source from a mixture signal consisting of several sources. The techniques can be used to process sounds for human listeners for example to improve speech intelligibility, or to a preprocessing step for computational analysis methods, since analysis and processing of isolated sources can be done with much better accuracy than the processing of mixtures of sounds.
- Singing voice and music separation
Separation of instrumental tracks from a song
- Spatial sound source separation
Separation of sound sources based on their perceived direction
- Object-based coding of spatial audio
Separation of sound objects from spatial audio mixtures
- Speech enhancement and separation
Enhancement of target speaker's voice from background noise or interfering voices
Music information retrieval
The on-line and personal music collections nowadays are huge, not being limited by space requirements anymore. Such collections created the need to develop efficient music information retrieval (MIR) techniques to automatically organize and search through collections. Our research team is a world-leading research group in audio-based MIR, where the information about the music is analyzed automatically based on the sound signals of the music pieces.
- Automatic musical instrument recognition
Automatically recognizing musical instruments from polyphonic and multi-timbral music signals.
- Singer identification
Identifying the singer from a musical piece
- Lyrics recognition
Automatic recognition of the words/phonemes in singing
- Automatic alignment of singing and lyrics
Synchronization of the lyrics text with the singing voice in a musical piece
- Sound source separation
- Music transcription
Automatically extracting musical notation for musical signals.
- Music structure analysis
Subdividing a musical piece into parts and sections at the largest time-scale. In popular music, for example, it is usually possible to identify parts that we label as the chorus, the verse, an introductory section, and so on.
- Structured audio coding
Developing sparse and structured representations for music signals, sound source modeling (parametric and statistical models)
- Music classification
Classification of music according to the style, the genre, or the musical instruments involved
- Musical instrument synthesis Synthesis of two or more musical instruments, resulting in a new instrument that possesses the acoustic characteristics of the instruments involved
One of the major problems in automatic speech recognition technologies is the sensitivity of recognizers to any interfering sounds. Since natural environments often include other sound sources, the performance of the existing technologies is severely limited. Our research team has been doing pioneering work in the recognition of sounds in mixtures, including speech, music and environmental sounds. This work has resulted in top positions in the international CHiME evaluation campaigns.
- Noise-robust automatic speech recognition
Automatic speech recognition in sound mixtures
Speech synthesis and voice conversion
Speech synthesis refers to the artificial conversion of text into speech. It is needed in applications where the most convenient modality of interaction with the user is the speech: for instance in hands-free user interfaces, voice-driven services, and applications targeted for the blind. The aim of voice conversion is to convert the speech spoken by one speaker to sound like the speech of a specific target speaker while maintaining the content of an utterance. It can be used for example in creating various synthesis voices in text-to-speech systems as well as generating various voices from a speaker for dubbing purposes. In addition, knowledge on how to separate the speaker identity and the speech content can lead to improved results in speaker and speech recognition.
- Corpus-based speech synthesis
Speech synthesis using real recorded speech data as a basis: unit selection and hidden Markov model (HMM) based synthesis
- Hybrid-form synthesis combining unit selection and HMM-based synthesis
Multiform synthesis by combining the two methods
- Exemplar-based voice conversion
Converting speech of one speaker to sound like speech of a specific target speaker