Paper
arXiv preprint
Implementation
GitHub
Author
Hien Ohnaka1,2,
Shinnosuke Takamichi2,
Keisuke Imoto3,
Yuki Okamoto4,
Kazuki Fujii2,
Hiroshi Saruwatari2
(1: National Institute of Technology, Tokuyama College, Japan.
2: The University of Tokyo, Japan.
3: Doshisha University, Japan.
4: Ritsumeikan University, Japan.)
Audio Sample #1 (Synthesized sounds from different inputs)
(“Reconstructed” indicates reconstructed sound from ground-truth mel-spectrograms by neural vocoder.)
Environmental Sound | |||
Trashbox | |||
Cup | |||
Whistle | |||
Tear | |||
Clock |
Audio Sample #2 (Synthesized sounds with word-level augmentation)
Audio Sample #2-Onomatopoeia (with word-level augmentation)(Trashbox)
Word Repetition (times) | |||
Mel-spec. | |||
Sound | |||
Input Text | (/dong/) |
(/dongdongdong/) |
(/dongdongdongdongdong/) |
Audio Sample #2-Onomatopoeia (without word-level augmentation)(Trashbox)
Word Repetition (times) | |||
Mel-spec. | |||
Sound | |||
Input Text | (/dong/) |
(/dongdongdong/) |
(/dongdongdongdongdong/) |
Audio Sample #2-Visual Onomatopoeia (proposed) (with word-level augmentation)(Whistle)
Word Repetition (times) | |||
Mel-spec. | |||
Sound | |||
Input Image | (/beep/) |
(/beepbeepbeep/) |
(/beepbeepbeepbeepbeep/) |
Audio Sample #2-Visual Onomatopoeia (proposed) (without word-level augmentation)(Whistle)
Word Repetition (times) | |||
Mel-spec. | |||
Sound | |||
Input Image | (/beep/) |
(/beepbeepbeep/) |
(/beepbeepbeepbeepbeep/) |
Audio Sample #3 (Synthesized sounds with character-level augmentation)
Audio Sample #3-Onomatopoeia (with character-level augmentation)(Clock)
Character Repetition (times) | |||
Mel-spec. | |||
Sound | |||
Input Text | (/ding/) |
(/diiiiiing/) |
(/diiiiiiiiiiing/) |
Audio Sample #3-Onomatopoeia (without character-level augmentation)(Clock)
Character Repetition (times) | |||
Mel-spec. | |||
Sound | |||
Input Text | (/ding/) |
(/diiiiiing/) |
(/diiiiiiiiiiing/) |
Audio Sample #3-Visual Onomatopoeia (proposed) (with character-level augmentation)(Maracas)
Character Repetition (times) | 0 | 5 | 10 |
Mel-spec. | |||
Sound | |||
Input Image | (/shakeshakeshake/) |
(/shakeshakeshakeeeeee/) |
(/shakeshakeshakeeeeeeeeeee/) |
Audio Sample #3-Visual Onomatopoeia (proposed) (without character-level augmentation)(Maracas)
Character Repetition (times) | |||
Mel-spec. | |||
Sound | |||
Input Image | (/shakeshakeshake/) |
(/shakeshakeshakeeeeee/) |
(/shakeshakeshakeeeeeeeeeee/) |
Audio Sample #4 (Synthesized sounds with image stretching-based duration control)(Drum)
Audio Sample #5 (Synthesized sounds with visual onomatopoeia & sound event image)
Bonus Audio Sample
Bonus #1 : 100 times repeated synthesized sound (Trashbox)
Word Repetition (times) | 100 |
Mel-spec. | |
Sound |
Bonus #2 : Visual Onomatopoeia with variable image stretching (Whistle)
Stretch Ratio | 12 / 24 / 36 / 48 / 60 |
Mel-spec. | |
Sound | |
Input Image | (/beepbeepbeepbeepbeep/) |