Audio Research Group conducts research in many areas related to audio, speech, and music signals. The research contains both basic research on the core techniques such as acoustic pattern classification, as well as applied research targeting specific applications. 

Content analysis of audio

Content analysis of general audio is concerned with the analysis of audio for automatic extraction of relevant information such as acoustic scene characteristics and sound sources present. Applications of audio content analysis range from simple classification tasks for enabling context awareness, to security and surveillance systems based on detected sounds of interest, to the organization, indexing, tagging, and querying of large audio databases.


  • Acoustic scene classification
    Classifying recordings into classes that characterize the environment in which it was recorded.
  • Sound event detection and recognition
    Recognizing individual sound events present in the auditory scene.
  • Audio tagging
    Automatically applying descriptive tags for audio signals
  • Audio captioning
    Automatically applying textual description to audio signals
  • Audio-based database query
    Querying large audio databases based on audio content
  • General audio classification
    Classifying audio signals based on many different aspects of the sound.

Spatial audio

Microphone arrays provide a link between the physical locations of sound objects for the computer software and can allow capturing the sound field. Applications of microphone arrays include physical location determination such as speaker localization and speaker position tracking, signal enhancement, and separation.

  • Sound localization
    Automatic determination of the sound source physical location
  • Self-localization
    Automatic localization and synchronization of the device microphones
  • Speaker position tracking
    Tracking the location and detecting the appearance and disappearance of multiple speakers that are temporally overlapping
  • Shooter localization
    Shooter localization and estimation of bullet trajectory, caliber, and speed

Source separation and signal enhancement

Source separation means the tasks of estimating the signal produced by an individual sound source from a mixture signal consisting of several sources. The techniques can be used to process sounds for human listeners for example to improve speech intelligibility, or to a preprocessing step for computational analysis methods, since analysis and processing of isolated sources can be done with much better accuracy than the processing of mixtures of sounds.


  • Singing voice and music separation
    Separation of instrumental tracks from a song
  • Spatial sound source separation
    Separation of sound sources based on their perceived direction
  • Object-based coding of spatial audio
    Separation of sound objects from spatial audio mixtures
  • Speech enhancement and separation
    Enhancement of target speaker's voice from background noise or interfering voices

Music information retrieval

The on-line and personal music collections nowadays are huge, not being limited by space requirements anymore. Such collections created the need to develop efficient music information retrieval (MIR) techniques to automatically organize and search through collections. Our research team is a world-leading research group in audio-based MIR, where the information about the music is analyzed automatically based on the sound signals of the music pieces.


  • Automatic musical instrument recognition
    Automatically recognizing musical instruments from polyphonic and multi-timbral music signals.
  • Singer identification
    Identifying the singer from a musical piece
  • Lyrics recognition
    Automatic recognition of the words/phonemes in singing
  • Automatic alignment of singing and lyrics
    Synchronization of the lyrics text with the singing voice in a musical piece
  • Sound source separation
  • Music transcription
    Automatically extracting musical notation for musical signals.
  • Music structure analysis
    Subdividing a musical piece into parts and sections at the largest time-scale. In popular music, for example, it is usually possible to identify parts that we label as the chorus, the verse, an introductory section, and so on.
  • Structured audio coding
    Developing sparse and structured representations for music signals, sound source modeling (parametric and statistical models)
  • Music classification
    Classification of music according to the style, the genre, or the musical instruments involved
  • Musical instrument synthesis Synthesis of two or more musical instruments, resulting in a new instrument that possesses the acoustic characteristics of the instruments involved

Speech recognition

One of the major problems in automatic speech recognition technologies is the sensitivity of recognizers to any interfering sounds. Since natural environments often include other sound sources, the performance of the existing technologies is severely limited. Our research team has been doing pioneering work in the recognition of sounds in mixtures, including speech, music and environmental sounds. This work has resulted in top positions in the international CHiME evaluation campaigns.


  • Noise-robust automatic speech recognition
    Automatic speech recognition in sound mixtures

Speech synthesis and voice conversion

Speech synthesis refers to the artificial conversion of text into speech. It is needed in applications where the most convenient modality of interaction with the user is the speech: for instance in hands-free user interfaces, voice-driven services, and applications targeted for the blind. The aim of voice conversion is to convert the speech spoken by one speaker to sound like the speech of a specific target speaker while maintaining the content of an utterance. It can be used for example in creating various synthesis voices in text-to-speech systems as well as generating various voices from a speaker for dubbing purposes. In addition, knowledge on how to separate the speaker identity and the speech content can lead to improved results in speaker and speech recognition.


  • Corpus-based speech synthesis
    Speech synthesis using real recorded speech data as a basis: unit selection and hidden Markov model (HMM) based synthesis
  • Hybrid-form synthesis combining unit selection and HMM-based synthesis
    Multiform synthesis by combining the two methods
  • Exemplar-based voice conversion
    Converting speech of one speaker to sound like speech of a specific target speaker