Our research group has published open dataset for sound event detection research, TUT Sound events 2017. The dataset consists of recordings of street acoustic scenes with various levels of traffic and other activity. The scene was selected as representing an environment of interest for detection of sound events related to human activities and hazard situations. Dataset is released in two parts, development dataset and evaluation dataset, and these can be downloaded from Zenodo.

The dataset was collected in Finland between 06/2015 - 01/2016, and the data collection has received funding from the European Research Council.


Recording and annotation procedure

The recordings were captured each in a different streets. For each recording location, a 3-5 minute long audio recording was captured. The equipment used for recording consists of a binaural Soundman OKM II Klassik/studio A3 electret in-ear microphone and a Roland Edirol R-09 wave recorder using 44.1 kHz sampling rate and 24 bit resolution. For audio material recorded in private places, written consent was obtained from all people involved.

Individual sound events in each recording were annotated by the same person using freely chosen labels for sounds. Nouns were used to characterize the sound source, and verbs to characterize the sound production mechanism, using a noun-verb pair whenever this was possible. The annotator was instructed to annotate all audible sound events, decide the start time and end time of the sounds as he sees fit, and choose event labels freely. This resulted in a large set of raw labels.

Target sound event classes were selected to represent common sounds related to human presence and traffic. Mapping of the raw labels was performed, merging sounds into classes described by their source before selecting target classes. Target sound event classes for the dataset were selected based on the frequency of the obtained labels, resulting in selection of most common sounds for the street acoustic scene, in sufficient numbers for learning acoustic models. Mapping of the raw labels was performed, merging sounds into classes described by their source, for example “car passing by”, “car engine running”, “car idling”, etc into “car”, sounds produced by buses and trucks into “large vehicle”, “children yelling” and ” children talking” into “children”, etc.

Selected sound classes for the task are:

  • brakes squeaking
  • car
  • children
  • large vehicle
  • people speaking
  • people walking

Due to the high level of subjectivity inherent to the annotation process, a verification of the reference annotation was done using these mapped classes. Three persons (other than the annotator) listened to each audio segment annotated as belonging to one of these classes, marking agreement about the presence of the indicated sound within the segment. Agreement/disagreement did not take into account the sound event onset and offset, only the presence of the sound event within the annotated segment. Event instances that were confirmed by at least one person were kept, resulting in elimination of about 10% of the original event instances in the development set.