Recent work on music question answering (Music-QA) has primarily focused on single-track understanding, where models answer questions about an individual audio clip using its tags, captions, or metadata. However, listeners often describe music in comparative terms, and existing benchmarks do not systematically evaluate reasoning across multiple tracks. Building on the Jamendo-QA dataset, we introduce Jamendo-MT-QA, a dataset and benchmark for multi-track comparative question answering. From Creative Commons-licensed tracks on Jamendo, we construct 36,519 comparative QA items over 12,173 track pairs, with each pair yielding three question types: yes/no, short-answer, and sentence-level questions. We describe an LLM-assisted pipeline for generating and filtering comparative questions, and benchmark representative audio–language models using both automatic metrics and LLM-as-a-Judge evaluation.
Jamendo-MT-QA is the first large-scale benchmark specifically designed for multi-track comparative music reasoning.
An LLM-assisted four-stage pipeline for generating high-quality comparative QA pairs.
Generate rich captions for each track using Music Flamingo, capturing genre, tempo, instrumentation, mood, and production details.
Produce per-track question–answer pairs from Jamendo-QA as building blocks for comparative questions.
GPT-5-mini generates three comparative questions (yes/no, short-answer, sentence) per track pair with step-by-step reasoning.
Human evaluation + LLM-as-a-Judge to retain only items with perfect semantic scores (5/5/5), yielding 92.9% of original pairs.
Each track pair produces three types of comparative questions with step-by-step reasoning.
Is the tempo of electronic_CrawlBack more than twice the tempo of rnb_10Days?
Step 1: rnb_10Days is approximately 83 BPM. Step 2: electronic_CrawlBack is 150 BPM. Step 3: Twice rnb_10Days's tempo is 166 BPM (83 × 2). Step 4: 150 BPM is less than 166 BPM, so electronic_CrawlBack is not more than twice the tempo of rnb_10Days.
No
Which track has the faster tempo?
Step 1: Note track 1 tempo is 81 BPM. Step 2: Note track 2 tempo is 150 BPM. Step 3: Compare numeric BPM values: 150 BPM is greater than 81 BPM, so the faster track is the one at 150 BPM.
guitar_FutureofTears
In one sentence, how do the two tracks differ in genre, energy, and production approach?
Step 1: Identify folk_Standout as laid-back Reggae/Dub with warm, spacious production. Step 2: Identify electronic_RestAssured as high-energy Electro-Industrial/EBM with aggressive synths and tight compression. Step 3: Contrast energy (laid-back vs high-energy), genre (organic reggae/dub vs electronic industrial/EBM), and production (warm delays/reverbs vs aggressive compression and distortion).
folk_Standout is a laid-back Reggae/Dub track with warm, spacious production and mellow one-drop grooves, whereas electronic_RestAssured is a high-energy Electro-Industrial/EBM track built around aggressive synths, tight compression, and a relentless dance-floor drive.
Results on the subset of 2,010 track pairs. Models evaluated under two paradigms: caption-based (Cap) and multi-audio (Multi).
| Model | Type | Yes/No Acc. | Short Acc. | BLEU | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|---|---|---|---|
| Music Flamingo | Cap | 82.1% | 88.1% | 2.06 | 25.3 | 5.9 | 19.6 |
| GPT-4o-mini Audio | Multi | 77.3% | 73.2% | 0.87 | 17.8 | 2.7 | 14.7 |
| GPT-4o Audio | Multi | 69.6% | 84.4% | 1.04 | 20.0 | 3.6 | 16.7 |
| Qwen3-Omni | Multi | 59.7% | 75.5% | 1.22 | 19.9 | 3.9 | 16.0 |
| Qwen2-Audio | Multi | 34.4% | 77.7% | 0.57 | 14.3 | 1.4 | 11.3 |
| Qwen2-Audio | Cap | 26.9% | 55.5% | 0.58 | 15.3 | 1.6 | 12.1 |
| MU-LLaMA | Cap | 23.6% | 50.5% | 0.68 | 16.1 | 1.9 | 13.3 |
Key finding: Caption-based Music Flamingo achieves the strongest overall performance, suggesting that high-quality intermediate representations are more important than direct multi-audio access for comparative reasoning.
Dominant error types on incorrectly answered sentence-level questions (LLM judge score < 3).
Model avoids explicit comparison and produces a generic summary of both tracks.
Model introduces musical attributes not supported by the audio or captions.
Comparison is too coarse or too fine-grained relative to the question intent.