Jamendo-MT-QA: A Benchmark for
Multi-Track Comparative Music Question Answering

A large-scale dataset of 36,519 comparative QA items over 12,173 track pairs for evaluating music understanding models

Abstract

Recent work on music question answering (Music-QA) has primarily focused on single-track understanding, where models answer questions about an individual audio clip using its tags, captions, or metadata. However, listeners often describe music in comparative terms, and existing benchmarks do not systematically evaluate reasoning across multiple tracks. Building on the Jamendo-QA dataset, we introduce Jamendo-MT-QA, a dataset and benchmark for multi-track comparative question answering. From Creative Commons-licensed tracks on Jamendo, we construct 36,519 comparative QA items over 12,173 track pairs, with each pair yielding three question types: yes/no, short-answer, and sentence-level questions. We describe an LLM-assisted pipeline for generating and filtering comparative questions, and benchmark representative audio–language models using both automatic metrics and LLM-as-a-Judge evaluation.

Dataset at a Glance

Jamendo-MT-QA is the first large-scale benchmark specifically designed for multi-track comparative music reasoning.

36,519
QA Items
12,173
Track Pairs
7,061
Unique Tracks
56
Genres Covered

Data Construction Pipeline

An LLM-assisted four-stage pipeline for generating high-quality comparative QA pairs.

1

Caption Generation

Generate rich captions for each track using Music Flamingo, capturing genre, tempo, instrumentation, mood, and production details.

2

Single-Track QA

Produce per-track question–answer pairs from Jamendo-QA as building blocks for comparative questions.

3

Comparative QA Generation

GPT-5-mini generates three comparative questions (yes/no, short-answer, sentence) per track pair with step-by-step reasoning.

4

Quality Filtering

Human evaluation + LLM-as-a-Judge to retain only items with perfect semantic scores (5/5/5), yielding 92.9% of original pairs.

Dataset Examples

Each track pair produces three types of comparative questions with step-by-step reasoning.

rnb_10Days electronic_CrawlBack
Question

Is the tempo of electronic_CrawlBack more than twice the tempo of rnb_10Days?

Reasoning

Step 1: rnb_10Days is approximately 83 BPM. Step 2: electronic_CrawlBack is 150 BPM. Step 3: Twice rnb_10Days's tempo is 166 BPM (83 × 2). Step 4: 150 BPM is less than 166 BPM, so electronic_CrawlBack is not more than twice the tempo of rnb_10Days.

Answer

No

indie_AboveandBelow guitar_FutureofTears
Question

Which track has the faster tempo?

Reasoning

Step 1: Note track 1 tempo is 81 BPM. Step 2: Note track 2 tempo is 150 BPM. Step 3: Compare numeric BPM values: 150 BPM is greater than 81 BPM, so the faster track is the one at 150 BPM.

Answer

guitar_FutureofTears

folk_Standout electronic_RestAssured
Question

In one sentence, how do the two tracks differ in genre, energy, and production approach?

Reasoning

Step 1: Identify folk_Standout as laid-back Reggae/Dub with warm, spacious production. Step 2: Identify electronic_RestAssured as high-energy Electro-Industrial/EBM with aggressive synths and tight compression. Step 3: Contrast energy (laid-back vs high-energy), genre (organic reggae/dub vs electronic industrial/EBM), and production (warm delays/reverbs vs aggressive compression and distortion).

Answer

folk_Standout is a laid-back Reggae/Dub track with warm, spacious production and mellow one-drop grooves, whereas electronic_RestAssured is a high-energy Electro-Industrial/EBM track built around aggressive synths, tight compression, and a relentless dance-floor drive.

Baseline Leaderboard

Results on the subset of 2,010 track pairs. Models evaluated under two paradigms: caption-based (Cap) and multi-audio (Multi).

Model Type Yes/No Acc. Short Acc. BLEU ROUGE-1 ROUGE-2 ROUGE-L
Music Flamingo Cap 82.1% 88.1% 2.06 25.3 5.9 19.6
GPT-4o-mini Audio Multi 77.3% 73.2% 0.87 17.8 2.7 14.7
GPT-4o Audio Multi 69.6% 84.4% 1.04 20.0 3.6 16.7
Qwen3-Omni Multi 59.7% 75.5% 1.22 19.9 3.9 16.0
Qwen2-Audio Multi 34.4% 77.7% 0.57 14.3 1.4 11.3
Qwen2-Audio Cap 26.9% 55.5% 0.58 15.3 1.6 12.1
MU-LLaMA Cap 23.6% 50.5% 0.68 16.1 1.9 13.3

Key finding: Caption-based Music Flamingo achieves the strongest overall performance, suggesting that high-quality intermediate representations are more important than direct multi-audio access for comparative reasoning.

Error Analysis

Dominant error types on incorrectly answered sentence-level questions (LLM judge score < 3).

Comparative Collapse

Model avoids explicit comparison and produces a generic summary of both tracks.

Music Flamingo 35.3% GPT-4o 56.7% Qwen3-Omni 23.4%

Attribute Hallucination

Model introduces musical attributes not supported by the audio or captions.

Music Flamingo 64.0% GPT-4o 35.3% Qwen3-Omni 75.9%

Granularity Mismatch

Comparison is too coarse or too fine-grained relative to the question intent.

Music Flamingo 0.7% GPT-4o 8.0% Qwen3-Omni 0.7%