Music Arena: Ranking AI Music Models with Your Votes

September 19, 2025
Blog post authors: Yonghyun Kim, Nathan Pruyne, Chris Donahue.
For the full list of Music Arena authors and contributors, see our research paper.

[Music Arena] [Paper] [Code] [New! Dataset]

Introduction

As generative music models become increasingly capable of producing high-fidelity audio, it's time to rethink how we measure their performance to better reflect real-world human preferences.

On July 28th we launched Music Arena, a free & open platform where anyone can compare outputs from state-of-the-art text-to-music (TTM) models. In Music Arena, users enter a text prompt, listen to two anonymous audio clips generated by different models, and vote for their favorite. This "live evaluation" protocol allows us to build a dynamic public leaderboard that ranks models based on crowdsourced human judgments.

We are excited to share our first update on the Music Arena project, including our initial leaderboard amassed from 1k+ votes, new insights into text-to-music and preferences, and a comprehensive data release of the preferences we have collected thus far.

In this blog post, we'll cover:

  1. Why Music Arena? The unique challenges of evaluating music and how our platform is designed to address them.
  2. Key Features Tailored for Music. Details on our LLM-based routing system and fine-grained preference collection.
  3. Our Commitment to the Community. Our policies on data releases, transparency, and user privacy.
  4. The Public Leaderboard. A look at our live ranking system for TTM models.
  5. Initial Data Insights. Our most surprising findings about user listening behavior.
  6. What’s Next? Our plans for the platform and how you can contribute.

Why Music Arena?: The Challenge of Evaluating Music

The quality of music is inherently subjective. While several automatic evaluation metrics have been proposed, they often fail to capture essential qualities like creativity, emotional impact, and overall musicality. Traditional human listening studies are the gold standard, but even listening studies suffer from key issues:

Music Arena is designed to overcome these challenges by creating a centralized, standardized, and scalable evaluation platform that is open to everyone.

Data Lifecycle

Figure 1: Music Arena data lifecycle. The platform consists of a Frontend for user interaction, a Backend that orchestrates generation from various Model Endpoints, and a Database to store results.


Key Features Tailored for Music

Music presents unique challenges compared to other domains like text or images. We've built key features to address this:

1. LLM-based Prompt Routing and Content Moderation

TTM models are a diverse group. Some create instrumental music, while others can generate vocals with lyrics. To handle this, our intelligent Backend uses an LLM (GPT-4o) to:

2. Fine-grained Preference Collection

Music is consumed over time. A simple "A is better than B" vote doesn't tell the whole story. Thus, Music Arena collects richer data by:

This detailed data provides deeper insights into why users prefer one piece of music over another.

Music Arena UI

Figure 2: An example of the Music Arena user interface. After voting, users can see model information and generation stats. They can download the preferred music and optionally submit their reason of preference and general feedback on the platform.


Our Commitment to the Community


The Public Leaderboard

As of September 18, 2025, the Music Arena leaderboard is live! It ranks TTM models based on ongoing user votes, and the rankings are updated on a regular basis with each new data release. The leaderboard provides transparent insights into model performance, including their Arena Score, confidence intervals, total votes, and generation speed (Real-Time Factor, RTF).

We currently offer separate leaderboards for instrumental and vocal models, allowing for fair comparison within specific music generation categories. This live ranking serves as a crucial resource for researchers, developers, and music enthusiasts to track the progress of TTM technology.

Below are the first official leaderboards from our initial data release, based on votes collected from July 28 to August 31, 2025. For the latest leaderboard, see the Music Arena homepage.

🎹 Instrumental Models

Rank Model Arena Score 95% CI # Votes Generation Speed (RTF) Organization License Training Data Supports Lyrics Access
1 riffusion-fuzz-1-1 1250.8 +52.0 / -45.5 252 6.01 Riffusion Closed Unspecified True Proprietary
2 magenta-rt-large 1113.6 +56.5 / -57.2 276 1.01 Google DeepMind Apache 2.0 Stock False Open weights
3 musicgen-small 928.5 +40.4 / -46.7 278 0.86 Meta CC-BY-NC 4.0 Stock False Open weights
4 sao 924.7 +45.7 / -41.5 286 2.63 Stability AI STAI Community Open False Open weights
5 sao-small 782.4 +50.9 / -62.2 292 12.79 Stability AI STAI Community Open False Open weights

🎤 Vocal Models

Rank Model Arena Score 95% CI # Votes Generation Speed (RTF) Organization License Training Data Supports Lyrics Access
1 riffusion-fuzz-1-0 1172.5 +99.1 / -62.7 144 5.6 Riffusion Closed Unspecified True Proprietary
2 riffusion-fuzz-1-1 1087.3 +40.8 / -47.2 218 5.25 Riffusion Closed Unspecified True Proprietary
3 preview-ocelot 1045.7 +75.9 / -82.9 90 5.42 Hidden Closed Unspecified True Proprietary
4 preview-jerboa 1034.4 +92.6 / -80.8 88 5.61 Hidden Closed Unspecified True Proprietary
5 acestep 660.1 +75.5 / -121.3 178 2.89 ACE Studio Apache 2.0 Unspecified True Open weights

Leaderboard Plot

Figure 3: Music Arena Leaderboard Plots (July 28 - Aug 31, 2025). The plots for Instrumental (left) and Vocal (right) models are shown side-by-side. Each model is plotted by its Arena Score and Generation Speed (RTF), with colors indicating training data and shapes indicating access type (Open weights vs. Proprietary).


Initial Data Insights

From our initial data collection period (July 28 - Aug 31, 2025), we acquired 1,420 user-initiated battles, 1,051 of which resulted in a valid vote. This data powers our leaderboard and reveals new insights into user behavior. Here's what we've learned so far.

1. How do users write prompts on the platform?

About 27% of user battles start with a "🎲 Random Prompt", retrieving a prebaked prompt, while the majority of users write their own creative prompts from scratch. We find that users who write their own prompts are more likely to complete the vote (77% conversion rate) compared to those who use a prebaked prompt (65% conversion rate).

Prompt Type Total Battles Voted Battles Vote Ratio [%]
User-Written 1040 804 77.31
🎲 Random Prompt 380 247 65.00
Total User Battles 1420 1051 74.01

Note: Subsequent analyses in this report focus on the dataset of 1,051 voted battles, unless otherwise specified.

2. Listening Behavior is Sensitive to Track Order

The next thing we noticed was a stark difference in listening times: users listened to the left track (Track A) far longer than the right track (Track B). Raw data showed high maximum listening times, suggesting some users leave tabs open. To focus on typical behavior, we removed 196 outliers using the Interquartile Range (IQR) method. The statistics for the remaining data, shown below, paint a clearer picture.

Metric Track A [s] Track B [s]
Average 25.98 11.57
Median 20.66 7.08
Min 4.03 4.02
Max (Post-IQR) 88.15 46.76

Listening Data

Figure 4: Listening time distributions for Track A (left) and Track B (right). After removing outliers, the histograms show that users tend to listen to Track A longer than B (median 20.7s vs. 7.1s) before making a preference decision.

Why is Track A listened to so much longer? To investigate, we analyzed the very first "play" action users take in a battle. The analysis of 1,051 valid battles revealed an overwhelming (but not unexpected) bias towards users listening to Track A (the left track) in >95% of battles:

First-Played Track Count Probability [%]
Track A 1004 95.53
Track B 47 4.47

This led us to a new hypothesis: users will listen longer to whichever track they play first, regardless of its position.

We re-categorized all listening data into "First-Played Track" and "Second-Played Track" and ran the analysis again. After removing IQR outliers (203 battles), the findings corroborate this hypothesis: the discrepancy between play counts is even sharper when we analyze this.

Metric First-Played [s] Second-Played [s]
Average 26.39 10.80
Median 21.01 6.62
Min 4.03 4.02
Max (Post-IQR) 88.15 42.60

The data clearly shows that users dedicate significantly more time to the track they listen to first—more than double on average (26.4s vs. 10.8s).

Listening Data 2

Figure 5: Listening time distributions for the First-Played and Second-Played tracks. After removing IQR outliers, the data clearly shows that the first track a user engages with receives substantially more listening time.

In pairwise music preference decisions, we observe that users often listen extensively to the first track, but only briefly to the second before making a decision. A plausible explanation for this is that the first track acts as a reference point. Evaluating this initial track appears to require an open-ended assessment of its quality, a task that can demand significant time and mental effort. Once this reference point is established, the evaluation of the second track seems to simplify to a more direct, relative comparison: "Is this track better or worse than the one I just heard?" This comparative judgment likely requires less cognitive effort and therefore less listening time, which may account for the discrepancy observed in our data.

3. Dissecting User Behavior: Engagement vs. Preference

A core question in our analysis is what listening time represents as a behavioral metric. To investigate whether extended listening time might correlate with a positive preference, we analyzed the win rate of the first-played track, which consistently receives more listening time.

Outcome for First-Played Track Count Probability [%]
Win 356 33.87
Loss 437 41.58
Tie / Both Bad 258 24.55

The data indicates a negative relationship between being the longer-played track and the final outcome. When excluding ties, the first-played track wins 44.9% of the time. This result suggests that the time a user spends on a track should be interpreted as a measure of engagement, which is distinct from their final preference.

This distinction is supported by other observations. For instance, users spend nearly twice as long evaluating vocal tracks as they do instrumental tracks. Furthermore, a direct correlation analysis between listening time and win rate across the entire dataset showed no significant trend. These points lead to a functional definition of our user behavior metrics: Listening Time serves as a proxy for user engagement, while the final vote, or Win Rate, is the direct measure of model performance.

While creating music that holds user attention is an important goal, our data suggests that listening time in a battle context is a complex signal, not just a measure of positive interest.

Model Listening Time

Figure 6: Listening Time Distribution by Model Category. The distributions show that users on average spend more time listening to models that generate vocal tracks compared to those that generate instrumental tracks.

4. Do Closer Battles Lead to Deeper Engagement?

To see if users work harder on more difficult comparisons, we analyzed whether engagement increases when two models are closely matched. We measured two key engagement metrics—total listening time and the number of swaps—against the "closeness" of a battle, defined by the absolute difference between the models' Arena Scores.

After removing outliers (118 battles) using the IQR method, our analysis of listening time reveals a statistically significant, albeit very weak, negative correlation (Spearman's rho = -0.082, p-value = 0.012). This suggests that as the performance gap between two models becomes larger (i.e., one model may be clearly better than the other), users tend to spend slightly less time listening before casting their vote.

Correlation

Figure 7: The correlation between Arena Score difference and total listening time. Each dot represents a single battle after outliers were removed using the IQR method. While there is a high variance in listening times, the regression line (red dashes) shows a slight negative trend.

We observed a similar trend with the number of swaps. The correlation is also slightly negative (Spearman's rho = -0.097, p-value = 0.002), indicating that users switch back and forth slightly more often when models are closely matched, demanding more effort to make a decision.

Correlation

Figure 8: The correlation between Arena Score difference and the number of swaps. Each dot represents a single battle. While there is a high variance in number of swaps, the regression line (blue dashes) shows a slight negative trend.

This suggests users adapt their listening strategy for harder decisions. While a single swap is the norm, closer battles prompt a more careful, back-and-forth comparison. The table below shows the general distribution of swap counts across all voted-on battles:

Swap(s) Probability [%]
1 62.99
2 23.60
3 7.99
4+ 5.42

While not a strong effect, these findings support the hypothesis that our fine-grained listening data successfully captures a signal related to the perceived difficulty of the preference decision.

5. User engagement on Music Arena

Our analysis shows a long-tail distribution of user engagement. A large number of users try the platform by casting just a few votes, while a dedicated group of "power users" contribute a significant number of votes. This mix of casual and dedicated users provides a broad and deep set of preference data.

The table below shows the distribution of votes submitted per user based on 1051 valid votes from 373 unique users. For example, 193 users have submitted exactly one vote, while a small number of highly engaged users are responsible for dozens of submissions. The most "power user" has submitted 49 votes.

Number of Votes Number of Users
1 193
2 72
3 44
4 24
5 8
6-10 18
11-20 10
21-50 4

6. How descriptive are user prompts?

Prompt Lengths

Figure 9: Distribution of User Prompt Lengths (from Voted Battles). The histogram shows the distribution after outliers were removed using the IQR method, revealing that the vast majority of user prompts are under 33 words long.

Analyzing the 804 user-written prompts from valid, voted-on battles reveals a clear trend: the vast majority are concise and to the point. The raw data shows a median prompt length of just 7 words, but the average is skewed higher by a long tail of very descriptive prompts, with a maximum length of 1000 words.

Metric (Raw Data) Prompt Length [words]
count 804
mean 18.68
std 54.82
min 1
50% (median) 7
max 1000

To get a more accurate picture of typical behavior, we removed extreme outliers (82 battles) using the IQR method, which set a threshold at 33 words. After filtering, the statistics for the remaining 722 prompts show a more focused distribution, with a median length of 6 words.

Metric Prompt Length [words]
count 722
mean 8.27
std 6.87
min 1
50% (median) 6
max (Post-IQR) 33

This confirms that while some users provide detailed instructions, the typical Music Arena user prefers to express their creative ideas in just a few words.

7. What kind of music do users create?

By analyzing the 804 user-written prompts from valid, voted battles, we can see what our users are creating. The results show a mix of genres, instruments, and moods.

Keyword Frequency
bass 101
pop 98
vocals 81
piano 70
rock 69
dark 66
melodic 66
chorus 65

Requests for specific instruments are very common, with "bass" (101), "vocals" (81), and "piano" (70) appearing frequently. Popular genres include "pop" (98) and "rock" (69). Users also provide detailed creative direction, using moods like "dark" (66) and musical descriptors such as "melodic" (66) and "chorus" (65). This analysis gives us a direct window into the creative intentions of our users.

Frequent Keywords

Figure 10: A word cloud of the most frequent keywords in user prompts from voted battles.

8. How do users react to model generations?

Alongside battle data, we also offer users the opportunity to provide written feedback on the generated music. With this feedback, we hope to both get a better understanding of the models' performance, with a longer term goal of tailoring our evaluation platform to better assess these aspects.

We received 147 written feedbacks on model generations, which we further analyze using an LLM (Gemini 2.5 Flash). We ask the LLM to label each feedback with a sentiment, either positive or negative, as well as a category, either Generation Quality, Prompt Adherence, or Musicallaneous (i.e. non-descriptive) Feedback. These categories were first identified by us, then confirmed to be the most salient categorization by the LLM. Examples of each type of feedback are shown below:

Example Feedback Positive Negative
Generation Quality "good voice quality" "It has sharp sound that is quite ear-piercing"
Prompt Adherence "Good lyrics and followed prompt well" "it had nothing to do with what i asked for"
Miscallaneous Feedback "The lines were pretty fire and the beat was good too" "This isn't even music"

We find that a majority of feedback (exactly 2 out of every 3) is negative, indicating that users are more likely to give feedback when the models do not behave as they may wish. We also find that users are more likely to provide positive feedback for generation quality (42.6% positive comments), and less likely for prompt adherence (25.8% positive comments).

Type Total Positive [%]
Prompt Adherence 62 25.8
Miscallaneous Feedback 24 29.2
Generation Quality 61 42.6

Finally, we also group feedback by model and find that the overall rate of positive feedback correlates with a model's win rate (Spearman's rho = 0.895, p = 0.001). Both prompt adherence (rho = 0.586, p = 0.097) and generation quality ( rho = 0.726, p = 0.027) also increase with higher win rate models, but generation quality also has a noticeable increase in positive feedback percentage once a model's win rate exceeds 50%.

Grouping these model feedbacks by category also yields some interesting discoveries, for instance, all feedback relating to prompt adherence for the Magenta RealTime model is negative, while all feedback relating to generation quality is positive, indicating that this system is good at creating high-quality generations, but fails to meet the prompting expectations of users.

Positive feedback rates by model

Figure 11: Plots of win rate vs positive feedback percentage for all models, also separated by different types of feedback


What’s Next?

The launch of our public leaderboard is just the beginning. We are continuously working to improve Music Arena and deepen our understanding of human preferences for generative music. Here's what you can look forward to:

We hope that Music Arena will bring new clarity to TTM evaluation and provide a foundational resource for building the next generation of music models that are better aligned with human creative values.

Cite this work:

@misc{kim2025musicarena,
    title={Music Arena: Live Evaluation for Text-to-Music},
    author={Yonghyun Kim and Wayne Chi and Anastasios Angelopoulos and Wei-Lin Chiang and Koichi Saito and Shinji Watanabe and Yuki Mitsufuji and Chris Donahue},
    journal={arXiv:2507.20900},
    year={2025}
}

Acknowledgements and Disclosure of Funding
Music Arena is supported by funding from Sony AI, with informal and pro-bono assistance provided by LMArena. We extend our sincere thanks to our commercial contacts at Riffusion, Stability AI, Google DeepMind, and Suno for productive discussions that informed the key features and policies of Music Arena. Music Arena is approved by CMU's Institutional Review Board under Protocol STUDY2024_00000489.

×