The on-line and personal music collections nowadays are huge, not being limited by space requirements anymore. Such collections created the need to develop efficient music information retrieval (MIR) techniques to automatically organize and search through collections. Our research team is a world-leading research group in audio-based MIR, where the information about the music is analyzed automatically based on the sound signals of the music pieces.


Automatic musical instrument recognition

Understanding the timbre of musical instruments or drums is an important issue for automatic music transcription, music information retrieval, and computational auditory scene analysis. In particular, the recent worldwide popularization of online music distribution services and portable digital music players makes musical instrument recognition even more important. Musical instruments are one of the main criteria (besides musical genre), which can be used to search a certain type of music from music databases. Some classical music pieces are even characterized by the used musical instruments (e.g. piano sonata, string quartet).


Singer identification

Singing is used to produce musically relevant sounds by the human voice, and it is employed in most cultures for entertainment or self-expression. The singing voice becomes immediately the main focus of attention when we listen to musical pieces with a vocal part. Singing consists of two main aspects: melodic (represented by the time-varying pitch) and verbal (represented by the lyrics). Both the melody and the lyrics allow us to identify the song, at the same time the singing voice reflects the identity of the singer. The singing voice carries a lot of information, therefore it can be used for different music information retrieval tasks and other applications.

Most people use the singer's voice as the primary cue for identifying a song. Also, the natural classification of music, besides genre, is the artist's name (often equivalent to the singer's name). A singer identification system would be useful for music information retrieval systems in the case of identifying singers for songs. The inherent difficulties lie in the nature of the problem: the voice is usually accompanied by other musical instruments and even though humans are extremely skillful in recognizing sounds in acoustic mixtures, interfering sounds usually make the automatic recognition very difficult.


Automatic alignment of singing and lyrics

This topic deals with the alignment of music that contains singing voice and instrumental accompaniment with the corresponding textual lyrics, i.e., finding the temporal relationship between the two inputs. The alignment is based on the phonetic transcription of the textual lyrics that will be aligned with the phonemes from the singing voice content of the audio. The alignment can be directly applied in automated karaoke annotation systems, but it has also potential in automatic singing database labeling and keyword spotting in singing database search algorithms. The problem can be viewed as an intermediate goal in the significantly harder problem of recognizing lyrics in polyphonic audio.


Lyrics recognition

The transcription of lyrics using a large vocabulary speech recognizer (LVCSR) is still regarded as a nearly impossible task because of many aspects. First of all, the performance of automatic speech recognition using an LVCSR is limited. Second, there are important phonetic and timing differences between speech and the singing voice, that must be dealt with. Last but not least, real-world music is polyphonic. Even having a system that can recognize singing, the interference of the instrumental background would degrade significantly its performance. In polyphonic music, the lyrics recognition problem becomes more difficult, and it relies on separating the vocals from the polyphonic mixture. Still, singing and speech convey similar information and originate from the same physical model. It is plausible that singing recognition can be done using standard techniques in automatic speech recognition. Even though the results are far from being perfect, they have the potential for particular tasks such as word spotting, automatic tagging, or song retrieval.


Music transcription

Music transcription means notating previously unannotated music into symbolic form (e.g. MIDI). In order to be able to automatically transcribe music, notes and tempo or beat has to be detected. Over the years, our research group has produced several state-of-the-art results in the field of multiple fundamental frequency analysis, beat tracking, and musical meter analysis. The aim of multiple fundamental frequency analysis is to find frequencies of multiple simultaneous sounds. The aim of beat tracking is to find the rhythmic pulse in music which corresponds to the tempo of the piece and matches the "foot-tapping" times of a human listener. The meter consists of the beat (tactus) pulse, together with faster and lower pulses at different time scales.

Research results in multiple fundamental frequency analysis and musical meter analysis have been successfully applied to the transcription of polyphonic music and more specifically singing, bass line, and percussions. Our group has produced several state-of-the-art results also in the field of musical transcription.

  • Singing
    Singing transcription aims at automatically converting a recorded singing signal into a parametric representation, e.g., a MIDI file. Examples of singing melody transcription in monophonic and polyphonic music.
  • Bass line
    Automatically transcribe the bass line in polyphonic music signals.
  • Percussion
    To recognize the percussive content (drums) from a musical performance and create a symbolic representation from it. Given an input signal, the system creates a score of the played drums.
  • Polyphonic music
    Polyphonic music transcription aims at transcribing simultaneously sounding notes played with pitched instruments from real-world music from any musical genre.


Music structure analysis

Music structure analysis means subdividing a musical piece into parts and sections at the largest time-scale. Especially popular music pieces have a distinct structure defined by repetitions of different parts (e.g., verse and chorus). Being able to infer the structure from the audio enables several applications, such as easier navigation within the piece, music thumbnailing, and mash-ups.

Music classification

Classification of music according to the style, the genre or the musical instruments involved.


Sound source separation

In music, there are often instruments or singers active at the same time, which makes the automatic analysis difficult. In sound source separation, the key idea is to estimate the signal produced by each sound source from a mixture signal consisting of several sources.

Structured audio coding

In object-based audio coding, the original signal is represented as a set of objects that have time dependent gain. The extraction of these sound objects makes the object-based coding to be closely related to sound source separation. Recently, the non-negative matrix factorization (NMF) applied to audio spectrogram has been studied for sound separation purposes, and it has provided promising results for extracting the sound sources from a mixture signal. Therefore a study of audio compression based on NMF representation was launched by our research team.