Satvik Dixit1, Koichi Saito2, Zhi Zhong3, Yuki Mitsufuji2,3, Chris Donahue1
1Carnegie Mellon University, 2Sony AI, 3Sony Group Corporation
Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley---sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music—domains that lie outside the use case for Foley.
To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (silent video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset is built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube-based and Vimeo-based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes. We benchmark 11 state-of-the-art V2A models, evaluating them on audio quality, audio–video alignment, temporal synchronization, and audio–text consistency. Our analysis reveals key limitations of existing models and provides actionable guidance for advancing the field.
Please email us to request access to the FoleyBench benchmark dataset: satvikdixit7@gmail.com
Evaluation results on the FoleyBench benchmark
| Method | IB ↑ | CLAP ↑ | DS ↓ | FAD ↓ | IS ↑ | KLD ↓ |
|---|---|---|---|---|---|---|
| V-AURA | 0.237 | -- | 0.716 | 27.2 | 6.44 | 3.46 |
| DiffFoley | 0.173 | -- | 0.88 | 31.9 | 9.26 | 3.88 |
| Seeing & Hearing | 0.371 | -- | 1.08 | 25.0 | 4.80 | 3.30 |
| MaskVAT | 0.239 | -- | 0.748 | 19.7 | 6.94 | 3.22 |
| V2A-Mapper | 0.189 | -- | 1.09 | 16.2 | 8.87 | 3.50 |
| SpecMaskFoleyT | 0.229 | 0.191 | 0.801 | 19.2 | 5.86 | 3.08 |
| VTA-LDMT | 0.221 | 0.138 | 1.21 | 15.7 | 7.27 | 3.13 |
| FoleyCrafterT | 0.255 | 0.261 | 1.15 | 16.5 | 9.50 | 2.68 |
| LOVAT | 0.209 | 0.167 | 1.15 | 20.7 | 7.61 | 3.15 |
| CAFAT | 0.198 | 0.270 | 0.825 | 15.5 | 7.41 | 2.54 |
| MMAudioT | 0.306 | 0.331 | 0.447 | 8.76 | 11.2 | 2.43 |
T indicates text-conditioned models. ↑ higher is better, ↓ lower is better. Bold indicates best performance, underlined indicates second best.
FAD, IS, and KLD are calculated on PANN embeddings. We used the AV-benchmark toolkit for evaluation.
Here are 10 randomly selected videos from the FoleyBench benchmark and the widely used vggsound_test set
Our experiments show that widely used benchmarks such as VGGSound Test often include content unsuitable for Foley evaluation, which can lead to misleading conclusions about model performance.
10 randomly sampled videos from the FoleyBench dataset
10 randomly sampled videos from the VGGSound dataset