Below you can find three columns. In each column you can see an audio player with three textual descriptions (captions) beneath it. The three captions correspond to the sound that you can hear from the audio player and are:
- Predicted caption:
- the exact predicted caption by our method. Maximum length of predicted caption is 10 words.
- Processed caption:
- a processed version of the original caption and the targeted output. The processing consisted of trimming the word length to maximum of 10 words, turn letters to small case, remove non-frequent words, remove punctuation, and remove words not apperaring in UK or USA english dictionaries (according to GNU Aspell Dictionaries).
- Original caption:
- the original caption, as given in the metadata associated with each sound. Original captions were used only to create the Processed captions and are listed here only for reference. For better reading, you must click on the "Original caption:" in order for it to appear.
Columns correspond to categorization of the predicted captions according to the employed metrics.
- The Good
- Predicted captions in the first column have scored good in the metrics. This means that the words appearing in the Predicted caption are the same (or almost the same) and in the same order as the ones in the Processed caption.
- The (not so) Bad
- Predicted captions in the second column do not have a good metric score, but describe adequatly the sound of the audio player. This means that the Predicted caption describes adequatly the sound but might not contain any word of those in the Processed caption.
- (and) The Ugly
- Predicted captions in the third column they neither have a good metric score nor describe adeuatly the sound of the audio player.
All sounds and original descriptions are drawn from the ProSounds Effect Library, available here!
Our method was tested over 2960 audio files with their associated captions. The result metrics for our method are: