Paper
arXiv preprint
Implementation
GitHub
Author
Hien Ohnaka1,2,
Shinnosuke Takamichi2,
Keisuke Imoto3,
Yuki Okamoto4,
Kazuki Fujii2,
Hiroshi Saruwatari2
(1: National Institute of Technology, Tokuyama College, Japan.
2: The University of Tokyo, Japan.
3: Doshisha University, Japan.
4: Ritsumeikan University, Japan.)
Audio Sample #1 (Synthesized sounds from different inputs)
(“Reconstructed” indicates reconstructed sound from ground-truth mel-spectrograms by neural vocoder.)
| Environmental Sound | |||
| Trashbox | |||
| Cup | |||
| Whistle | |||
| Tear | |||
| Clock |
Audio Sample #2 (Synthesized sounds with word-level augmentation)
Audio Sample #2-Onomatopoeia (with word-level augmentation)(Trashbox)
| Word Repetition (times) | |||
| Mel-spec. | ![]() |
![]() |
![]() |
| Sound | |||
| Input Text | (/dong/) |
(/dongdongdong/) |
(/dongdongdongdongdong/) |
Audio Sample #2-Onomatopoeia (without word-level augmentation)(Trashbox)
| Word Repetition (times) | |||
| Mel-spec. | ![]() |
![]() |
![]() |
| Sound | |||
| Input Text | (/dong/) |
(/dongdongdong/) |
(/dongdongdongdongdong/) |
Audio Sample #2-Visual Onomatopoeia (proposed) (with word-level augmentation)(Whistle)
| Word Repetition (times) | |||
| Mel-spec. | ![]() |
![]() |
![]() |
| Sound | |||
| Input Image | ![]() (/beep/) |
![]() (/beepbeepbeep/) |
![]() (/beepbeepbeepbeepbeep/) |
Audio Sample #2-Visual Onomatopoeia (proposed) (without word-level augmentation)(Whistle)
| Word Repetition (times) | |||
| Mel-spec. | ![]() |
![]() |
![]() |
| Sound | |||
| Input Image | ![]() (/beep/) |
![]() (/beepbeepbeep/) |
![]() (/beepbeepbeepbeepbeep/) |
Audio Sample #3 (Synthesized sounds with character-level augmentation)
Audio Sample #3-Onomatopoeia (with character-level augmentation)(Clock)
| Character Repetition (times) | |||
| Mel-spec. | ![]() |
![]() |
![]() |
| Sound | |||
| Input Text | (/ding/) |
(/diiiiiing/) |
(/diiiiiiiiiiing/) |
Audio Sample #3-Onomatopoeia (without character-level augmentation)(Clock)
| Character Repetition (times) | |||
| Mel-spec. | ![]() |
![]() |
![]() |
| Sound | |||
| Input Text | (/ding/) |
(/diiiiiing/) |
(/diiiiiiiiiiing/) |
Audio Sample #3-Visual Onomatopoeia (proposed) (with character-level augmentation)(Maracas)
| Character Repetition (times) | 0 | 5 | 10 |
| Mel-spec. | ![]() |
![]() |
![]() |
| Sound | |||
| Input Image | ![]() (/shakeshakeshake/) |
![]() (/shakeshakeshakeeeeee/) |
![]() (/shakeshakeshakeeeeeeeeeee/) |
Audio Sample #3-Visual Onomatopoeia (proposed) (without character-level augmentation)(Maracas)
| Character Repetition (times) | |||
| Mel-spec. | ![]() |
![]() |
![]() |
| Sound | |||
| Input Image | ![]() (/shakeshakeshake/) |
![]() (/shakeshakeshakeeeeee/) |
![]() (/shakeshakeshakeeeeeeeeeee/) |
Audio Sample #4 (Synthesized sounds with image stretching-based duration control)(Drum)
Audio Sample #5 (Synthesized sounds with visual onomatopoeia & sound event image)
Bonus Audio Sample
Bonus #1 : 100 times repeated synthesized sound (Trashbox)
| Word Repetition (times) | 100 |
| Mel-spec. | ![]() |
| Sound |
Bonus #2 : Visual Onomatopoeia with variable image stretching (Whistle)
| Stretch Ratio | 12 / 24 / 36 / 48 / 60 |
| Mel-spec. | ![]() |
| Sound | |
| Input Image | ![]() (/beepbeepbeepbeepbeep/) |