Visual onoma-to-wave: environmental sound synthesis from visual onomatopoeia and sound source image

Implementation
GitHub

Author
Hien Ohnaka^1,2, Shinnosuke Takamichi², Keisuke Imoto³, Yuki Okamoto⁴, Kazuki Fujii², Hiroshi Saruwatari²
(¹: National Institute of Technology, Tokuyama College, Japan. ²: The University of Tokyo, Japan. ³: Doshisha University, Japan. ⁴: Ritsumeikan University, Japan.)

Audio Sample #1 (Synthesized sounds from different inputs)
(“Reconstructed” indicates reconstructed sound from ground-truth mel-spectrograms by neural vocoder.)

Environmental Sound	Reconstructed	Onomatopoeia	Visual Onomatopoeia (proposed)
Trashbox
Cup
Whistle
Tear
Clock

Audio Sample #2 (Synthesized sounds with word-level augmentation)

Audio Sample #2-Onomatopoeia (with word-level augmentation)(Trashbox)

Word Repetition (times)	0	2	4
Mel-spec.
Sound
Input Text	ドーン (/dong/)	ドーンドーンドーン (/dongdongdong/)	ドーンドーンドーンドーンドーン (/dongdongdongdongdong/)

Audio Sample #2-Onomatopoeia (without word-level augmentation)(Trashbox)

Word Repetition (times)	0	2	4
Mel-spec.
Sound
Input Text	ドーン (/dong/)	ドーンドーンドーン (/dongdongdong/)	ドーンドーンドーンドーンドーン (/dongdongdongdongdong/)

Audio Sample #2-Visual Onomatopoeia (proposed) (with word-level augmentation)(Whistle)

Word Repetition (times)	0	2	4
Mel-spec.
Sound
Input Image	(/beep/)	(/beepbeepbeep/)	(/beepbeepbeepbeepbeep/)

Audio Sample #2-Visual Onomatopoeia (proposed) (without word-level augmentation)(Whistle)

Word Repetition (times)	0	2	4
Mel-spec.
Sound
Input Image	(/beep/)	(/beepbeepbeep/)	(/beepbeepbeepbeepbeep/)

Audio Sample #3 (Synthesized sounds with character-level augmentation)

Audio Sample #3-Onomatopoeia (with character-level augmentation)(Clock)

Character Repetition (times)	0	5	10
Mel-spec.
Sound
Input Text	チリ (/ding/)	チリリリリリリ (/diiiiiing/)	チリリリリリリリリリリリ (/diiiiiiiiiiing/)

Audio Sample #3-Onomatopoeia (without character-level augmentation)(Clock)

Character Repetition (times)	0	5	10
Mel-spec.
Sound
Input Text	チリ (/ding/)	チリリリリリリ (/diiiiiing/)	チリリリリリリリリリリリ (/diiiiiiiiiiing/)

Audio Sample #3-Visual Onomatopoeia (proposed) (with character-level augmentation)(Maracas)

Character Repetition (times)	0	5	10
Mel-spec.
Sound
Input Image	(/shakeshakeshake/)	(/shakeshakeshakeeeeee/)	(/shakeshakeshakeeeeeeeeeee/)

Audio Sample #3-Visual Onomatopoeia (proposed) (without character-level augmentation)(Maracas)

Character Repetition (times)	0	5	10
Mel-spec.
Sound
Input Image	(/shakeshakeshake/)	(/shakeshakeshakeeeeee/)	(/shakeshakeshakeeeeeeeeeee/)

Audio Sample #4 (Synthesized sounds with image stretching-based duration control)(Drum)

Stretch Ratio	Without Stretch	0.5	1.0	1.5	2.0
Mel-spec.
Sound
Input Image

Audio Sample #5 (Synthesized sounds with visual onomatopoeia & sound event image)

Event Image Type	Reconstructed (ref)	Line Drawing	Photo	Label
Input	(Hitting sound by metal bat)
Sound
Auxiliary Information				"Baseball Bat"
Input	(Hitting sound by wooden bat)
Sound
Auxiliary Information				"Baseball Bat"
Input	(Hitting sound by plastic bat)
Sound
Auxiliary Information				"Baseball Bat"