Note: The VGGSound dataset is also degrading over time as YouTube videos are removed.
Generating realistic sound for video (V2A) is a major challenge, especially for Foley—the art of creating sound effects synchronized with on-screen actions. Current benchmarks for this task are flawed. They are often contaminated with speech, music, and off-screen sounds, making it impossible to truly evaluate a model's ability to generate accurate, synchronized Foley. This misalignment leads to misleading conclusions about model performance.
We introduce FoleyBench, the first large-scale benchmark designed for evaluating Foley-style V2A generation.
Our experiments show that widely used benchmarks like VGGSound are unsuitable for Foley evaluation. Below are randomly selected videos from both datasets. The VGGSound examples often contain speech, music, or lack a clear, visible sound source.
Note: The VGGSound dataset is also degrading over time as YouTube videos are removed.
We benchmarked state-of-the-art V2A models on FoleyBench. Our results provide a clearer picture of current model capabilities.
| Method | IB ↑ | CLAP ↑ | DS ↓ | FAD ↓ | IS ↑ | KLD ↓ |
|---|---|---|---|---|---|---|
| V-AURA | 0.237 | -- | 0.716 | 27.2 | 6.44 | 3.46 |
| DiffFoley | 0.173 | -- | 0.88 | 31.9 | 9.26 | 3.88 |
| Seeing & Hearing | 0.371 | -- | 1.08 | 25.0 | 4.80 | 3.30 |
| MaskVAT | 0.239 | -- | 0.748 | 19.7 | 6.94 | 3.22 |
| V2A-Mapper | 0.189 | -- | 1.09 | 16.2 | 8.87 | 3.50 |
| SpecMaskFoleyT | 0.229 | 0.191 | 0.801 | 19.2 | 5.86 | 3.08 |
| VTA-LDMT | 0.221 | 0.138 | 1.21 | 15.7 | 7.27 | 3.13 |
| FoleyCrafterT | 0.255 | 0.261 | 1.15 | 16.5 | 9.50 | 2.68 |
| LOVAT | 0.209 | 0.167 | 1.15 | 20.7 | 7.61 | 3.15 |
| CAFAT | 0.198 | 0.270 | 0.825 | 15.5 | 7.41 | 2.54 |
| MMAudioT | 0.306 | 0.331 | 0.447 | 8.76 | 11.2 | 2.43 |
T indicates text-conditioned models. ↑ higher is better, ↓ lower is better. Bold indicates best performance, underlined indicates second best.
FAD, IS, and KLD are calculated on PANN embeddings. We used the AV-benchmark toolkit for evaluation.
To evaluate long-form audio generation, we introduce FoleyBench-Long, a challenging subset of 650 videos, each 30 seconds long.
| Method | IB ↑ | CLAP ↑ | D-S ↓ | FAD ↓ | IS ↑ | KLD ↓ |
|---|---|---|---|---|---|---|
| LOVAT | 0.237 | 0.102 | 1.20 | 26.2 | 5.02 | 2.44 |
| VTA-LDMT | 0.147 | 0.091 | 1.22 | 83.2 | 1.27 | 2.19 |
| MMAudioT | 0.239 | 0.174 | 0.638 | 27.5 | 3.87 | 2.40 |
Please email satvikdixit7@gmail.com to get your model included on our page.
Our dataset is under the CC-BY-NC-SA-4.0 license. It is intended for academic research purposes only. Commercial use is strictly prohibited. If there is any infringement, please contact satvikdixit7@gmail.com.