Audio captioning Demo

Below you can find three columns. In each column you can see an audio player with three textual descriptions (captions) beneath it. The three captions correspond to the sound that you can hear from the audio player and are:

Predicted caption:: the exact predicted caption by our method. Maximum length of predicted caption is 10 words.
Processed caption:: a processed version of the original caption and the targeted output. The processing consisted of trimming the word length to maximum of 10 words, turn letters to small case, remove non-frequent words, remove punctuation, and remove words not apperaring in UK or USA english dictionaries (according to GNU Aspell Dictionaries).
Original caption:: the original caption, as given in the metadata associated with each sound. Original captions were used only to create the Processed captions and are listed here only for reference. For better reading, you must click on the "Original caption:" in order for it to appear.

Columns correspond to categorization of the predicted captions according to the employed metrics.

The Good: Predicted captions in the first column have scored good in the metrics. This means that the words appearing in the Predicted caption are the same (or almost the same) and in the same order as the ones in the Processed caption.
The (not so) Bad: Predicted captions in the second column do not have a good metric score, but describe adequatly the sound of the audio player. This means that the Predicted caption describes adequatly the sound but might not contain any word of those in the Processed caption.
(and) The Ugly: Predicted captions in the third column they neither have a good metric score nor describe adeuatly the sound of the audio player.

All sounds and original descriptions are drawn from the ProSounds Effect Library, available here!

Our method was tested over 2960 audio files with their associated captions. The result metrics for our method are:

BLEU₁	BLEU₂	BLEU₃	BLEU₄	ROUGE_L	METEOR	CIDEr
0.191 ±0.004	0.129 ±0.003	0.106 ±0.003	0.094 ±0.003	0.149 ±0.002	0.092 ±0.002	0.526 ±0.012

The Good

Predicted caption:: music guitar distortion
Processed caption:: music guitar rock distortion
Original caption:: Music, Guitar, Rock, Distortion

Predicted caption:: water splash fast
Processed caption:: water splash 12
Original caption:: Water, Splash, Lunge 12

Predicted caption:: college clock striking
Processed caption:: college clock striking five
Original caption:: Balliol College clock striking five o'clock.

Predicted caption:: trailer dark 2
Processed caption:: trailer dark feedback
Original caption:: Trailer, Dark, Feedback

Predicted caption:: police radio signal
Processed caption:: three radio signal
Original caption:: Morse Code, "Three", Radio, Weak Signal

Predicted caption:: voice clip male police dispatch radio radio
Processed caption:: voice clip male police dispatch radio
Original caption:: Voice Clip, Male, Police Dispatch Radio, "I Need Transport"

Predicted caption:: bullet shell drop cement surface
Processed caption:: bullet shell 45 caliber drop cement surface
Original caption:: Bullet Shell, .45 Caliber, Drop, Cement Surface

Predicted caption:: impact hit heavy hard metal
Processed caption:: impact hit heavy hard metal debris
Original caption:: Impact, Hit, Heavy, Hard, Metal, Debris

Predicted caption:: ambience city region station
Processed caption:: ambience city traffic
Original caption:: Ambience, City, Traffic, Europe

Predicted caption:: gunshot revolver single shot 44 distant perspective microphone microphone
Processed caption:: gunshot pistol springfield single shot caliber distant perspective exterior microphone
Original caption:: Gunshot, Pistol, Springfield XDM40, Single Shot, .40 Caliber, Distant Perspective, Exterior, Microphone Towards Down Range

The (not so) Bad

Predicted caption:: chime musical
Processed caption:: cartoon
Original caption:: boing cartoon 15

Predicted caption:: crowd general with speech
Processed caption:: in hall mixed chatter quiet people
Original caption:: Audience in hall, mixed chatter, fairly quiet. (30 people)

Predicted caption:: water stream
Processed caption:: loop
Original caption:: Creek Flow, Loop

Predicted caption:: monster growl
Processed caption:: zombie creature heavy monster
Original caption:: Zombie Creature Breathe, Heavy, Monster, Creature

Predicted caption:: bullet wood empty
Processed caption:: hand
Original caption:: Suction Cup, Hand

Predicted caption:: monster alien
Processed caption:: door
Original caption:: stone, drag, stone door, manhole, dragging

Predicted caption:: industrial ambience room
Processed caption:: house monster
Original caption:: haunted, house, Halloween, spooky, ghost, ghosts, scary, fright, haunt, spook, fear, monster, monsters

Predicted caption:: body fall crack hit roll
Processed caption:: rock drop forest heavy
Original caption:: Rock Drop, Forest, Heavy

Predicted caption:: on motor
Processed caption:: and
Original caption:: JCB excavator levelling and excavating.

Predicted caption:: power
Processed caption:: crescendo
Original caption:: Crescendo

(and) The Ugly

Predicted caption:: open doors
Processed caption:: siren
Original caption:: Ferry: CrossChannel, `Dover', bridge, siren sounded.

Predicted caption:: debris car break
Processed caption:: typewriter bell
Original caption:: Typewriter Return, Bell

Predicted caption:: girl horror
Processed caption:: musical
Original caption:: Whooshes, Musical

Predicted caption:: wood shake
Processed caption:: plastic bag carpet surface metal
Original caption:: Plastic Bag Drag, Carpet Surface, Metal Objects

Predicted caption:: car door slide central wood slow
Processed caption:: distortion
Original caption:: Whooshes, Distortion

Predicted caption:: new chimes clock and
Processed caption:: fire engine with horn
Original caption:: Fire engine departs with horn

Predicted caption:: laugh clip
Processed caption:: animals birds crow bird
Original caption:: Animals, Birds, Crow, Caw, Bird

Predicted caption:: flying horses walk voice grass loud
Processed caption:: cu from being by adult food at and bird island
Original caption:: Wandering Albatross. CU frenetic bill tapping from chick being fed by adult. Food transfer at 2m25s and 3m28s. Top Meadows, Bird Island

Predicted caption:: beeps start up system
Processed caption:: cat
Original caption:: cat meow 17

Predicted caption:: church atmosphere species low
Processed caption:: interior whistle moves off and speed
Original caption:: Interior, guard's whistle, moves off and gathers speed.

Audio Captioning On-line Demo

The Good

The (not so) Bad

(and) The Ugly