Visual onoma-to-wave: environmental sound synthesis from visual onomatopoeia and sound source image


Paper
arXiv preprint

Implementation
GitHub

Author
Hien Ohnaka1,2, Shinnosuke Takamichi2, Keisuke Imoto3, Yuki Okamoto4, Kazuki Fujii2, Hiroshi Saruwatari2
(1: National Institute of Technology, Tokuyama College, Japan. 2: The University of Tokyo, Japan. 3: Doshisha University, Japan. 4: Ritsumeikan University, Japan.)


Audio Sample #1 (Synthesized sounds from different inputs)
(“Reconstructed” indicates reconstructed sound from ground-truth mel-spectrograms by neural vocoder.)

Environmental Sound
Reconstructed
Onomatopoeia
Visual Onomatopoeia (proposed)
Trashbox
Cup
Whistle
Tear
Clock

Audio Sample #2 (Synthesized sounds with word-level augmentation)

Audio Sample #2-Onomatopoeia (with word-level augmentation)(Trashbox)

Word Repetition (times)
0
2
4
Mel-spec.
Sound
Input Text
ドーン
(/dong/)
ドーンドーンドーン
(/dongdongdong/)
ドーンドーンドーンドーンドーン
(/dongdongdongdongdong/)

Audio Sample #2-Onomatopoeia (without word-level augmentation)(Trashbox)

Word Repetition (times)
0
2
4
Mel-spec.
Sound
Input Text
ドーン
(/dong/)
ドーンドーンドーン
(/dongdongdong/)
ドーンドーンドーンドーンドーン
(/dongdongdongdongdong/)

Audio Sample #2-Visual Onomatopoeia (proposed) (with word-level augmentation)(Whistle)

Word Repetition (times)
0
2
4
Mel-spec.
Sound
Input Image

(/beep/)

(/beepbeepbeep/)

(/beepbeepbeepbeepbeep/)

Audio Sample #2-Visual Onomatopoeia (proposed) (without word-level augmentation)(Whistle)

Word Repetition (times)
0
2
4
Mel-spec.
Sound
Input Image

(/beep/)

(/beepbeepbeep/)

(/beepbeepbeepbeepbeep/)

Audio Sample #3 (Synthesized sounds with character-level augmentation)

Audio Sample #3-Onomatopoeia (with character-level augmentation)(Clock)

Character Repetition (times)
0
5
10
Mel-spec.
Sound
Input Text
チリ
(/ding/)
チリリリリリリ
(/diiiiiing/)
チリリリリリリリリリリリ
(/diiiiiiiiiiing/)

Audio Sample #3-Onomatopoeia (without character-level augmentation)(Clock)

Character Repetition (times)
0
5
10
Mel-spec.
Sound
Input Text
チリ
(/ding/)
チリリリリリリ
(/diiiiiing/)
チリリリリリリリリリリリ
(/diiiiiiiiiiing/)

Audio Sample #3-Visual Onomatopoeia (proposed) (with character-level augmentation)(Maracas)

Character Repetition (times) 0 5 10
Mel-spec.
Sound
Input Image

(/shakeshakeshake/)

(/shakeshakeshakeeeeee/)

(/shakeshakeshakeeeeeeeeeee/)

Audio Sample #3-Visual Onomatopoeia (proposed) (without character-level augmentation)(Maracas)

Character Repetition (times)
0
5
10
Mel-spec.
Sound
Input Image
(/shakeshakeshake/)

(/shakeshakeshakeeeeee/)

(/shakeshakeshakeeeeeeeeeee/)

Audio Sample #4 (Synthesized sounds with image stretching-based duration control)(Drum)

Stretch Ratio
Without Stretch
0.5
1.0
1.5
2.0
Mel-spec.
Sound
Input Image

Audio Sample #5 (Synthesized sounds with visual onomatopoeia & sound event image)

Event Image
Type
Reconstructed (ref)
Line Drawing
Photo
Label
Input
(Hitting sound by metal bat)
Sound
Auxiliary
Information
"Baseball Bat"
Input
(Hitting sound by wooden bat)
Sound
Auxiliary
Information
"Baseball Bat"
Input
(Hitting sound by plastic bat)
Sound
Auxiliary
Information
"Baseball Bat"



Bonus Audio Sample

Bonus #1 : 100 times repeated synthesized sound (Trashbox)

Word Repetition (times) 100
Mel-spec.
Sound

Bonus #2 : Visual Onomatopoeia with variable image stretching (Whistle)

Stretch Ratio 12 / 24 / 36 / 48 / 60
Mel-spec.
Sound
Input Image
(/beepbeepbeepbeepbeep/)