FoleyBench: A Benchmark For Video-to-Audio Models

Summary

Generating realistic sound for video (V2A) is a major challenge, especially for Foley—the art of creating sound effects synchronized with on-screen actions. Current benchmarks for this task are flawed. They are often contaminated with speech, music, and off-screen sounds, making it impossible to truly evaluate a model's ability to generate accurate, synchronized Foley. This misalignment leads to misleading conclusions about model performance.

We introduce FoleyBench, the first large-scale benchmark designed for evaluating Foley-style V2A generation.

  • High-Quality, Foley-Focused: Contains 5,000 video-audio-text triplets meticulously curated for Foley. All content is non-speech/non-music, with strong causal links between visible actions and their sounds.
  • Diverse Category Coverage: Ensures a broad and balanced distribution across the Universal Category System (UCS), addressing a key limitation of previous datasets.
  • Rich Metadata for Deep Analysis: Each clip is annotated with source complexity (single vs. multi-source) and sound type (discrete vs. continuous), enabling fine-grained analysis of model strengths and weaknesses.

Comparing FoleyBench to VGGSound

Our experiments show that widely used benchmarks like VGGSound are unsuitable for Foley evaluation. Below are randomly selected videos from both datasets. The VGGSound examples often contain speech, music, or lack a clear, visible sound source.


Randomly selected videos from FoleyBench

Loading FoleyBench videos...

Randomly selected videos from VGGSound

Loading VGGSound videos...

Note: The VGGSound dataset is also degrading over time as YouTube videos are removed.

Results on FoleyBench

We benchmarked state-of-the-art V2A models on FoleyBench. Our results provide a clearer picture of current model capabilities.

Method IB ↑ CLAP ↑ DS ↓ FAD ↓ IS ↑ KLD ↓
V-AURA 0.237 -- 0.716 27.2 6.44 3.46
DiffFoley 0.173 -- 0.88 31.9 9.26 3.88
Seeing & Hearing 0.371 -- 1.08 25.0 4.80 3.30
MaskVAT 0.239 -- 0.748 19.7 6.94 3.22
V2A-Mapper 0.189 -- 1.09 16.2 8.87 3.50
SpecMaskFoleyT 0.229 0.191 0.801 19.2 5.86 3.08
VTA-LDMT 0.221 0.138 1.21 15.7 7.27 3.13
FoleyCrafterT 0.255 0.261 1.15 16.5 9.50 2.68
LOVAT 0.209 0.167 1.15 20.7 7.61 3.15
CAFAT 0.198 0.270 0.825 15.5 7.41 2.54
MMAudioT 0.306 0.331 0.447 8.76 11.2 2.43

T indicates text-conditioned models. ↑ higher is better, ↓ lower is better. Bold indicates best performance, underlined indicates second best.
FAD, IS, and KLD are calculated on PANN embeddings. We used the AV-benchmark toolkit for evaluation.

FoleyBench-Long

To evaluate long-form audio generation, we introduce FoleyBench-Long, a challenging subset of 650 videos, each 30 seconds long.

Method IB ↑ CLAP ↑ D-S ↓ FAD ↓ IS ↑ KLD ↓
LOVAT 0.237 0.102 1.20 26.2 5.02 2.44
VTA-LDMT 0.147 0.091 1.22 83.2 1.27 2.19
MMAudioT 0.239 0.174 0.638 27.5 3.87 2.40

Please email satvikdixit7@gmail.com to get your model included on our page.

License

Our dataset is under the CC-BY-NC-SA-4.0 license. It is intended for academic research purposes only. Commercial use is strictly prohibited. If there is any infringement, please contact satvikdixit7@gmail.com.