Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering

Abstract

Recent work on music question answering (Music-QA) has primarily focused on single-track understanding, where models answer questions about an individual audio clip using its tags, captions, or metadata. However, listeners often describe music in comparative terms, and existing benchmarks do not systematically evaluate reasoning across multiple tracks. Building on the Jamendo-QA dataset, we introduce Jamendo-MT-QA, a dataset and benchmark for multi-track comparative question answering. From Creative Commons-licensed tracks on Jamendo, we construct 36,519 comparative QA items over 12,173 track pairs, with each pair yielding three question types: yes/no, short-answer, and sentence-level questions. We describe an LLM-assisted pipeline for generating and filtering comparative questions, and benchmark representative audio–language models using both automatic metrics and LLM-as-a-Judge evaluation.

Data Construction Pipeline

An LLM-assisted four-stage pipeline for generating high-quality comparative QA pairs.

Caption Generation

Generate rich captions for each track using Music Flamingo, capturing genre, tempo, instrumentation, mood, and production details.

Single-Track QA

Produce per-track question–answer pairs from Jamendo-QA as building blocks for comparative questions.

Comparative QA Generation

GPT-5-mini generates three comparative questions (yes/no, short-answer, sentence) per track pair with step-by-step reasoning.

Quality Filtering

Human evaluation + LLM-as-a-Judge to retain only items with perfect semantic scores (5/5/5), yielding 92.9% of original pairs.

Dataset Examples

Each track pair produces three types of comparative questions with step-by-step reasoning.

rnb_10Days electronic_CrawlBack

Question

Is the tempo of electronic_CrawlBack more than twice the tempo of rnb_10Days?

Reasoning

Step 1: rnb_10Days is approximately 83 BPM. Step 2: electronic_CrawlBack is 150 BPM. Step 3: Twice rnb_10Days's tempo is 166 BPM (83 × 2). Step 4: 150 BPM is less than 166 BPM, so electronic_CrawlBack is not more than twice the tempo of rnb_10Days.

Answer

indie_AboveandBelow guitar_FutureofTears

Question

Which track has the faster tempo?

Reasoning

Step 1: Note track 1 tempo is 81 BPM. Step 2: Note track 2 tempo is 150 BPM. Step 3: Compare numeric BPM values: 150 BPM is greater than 81 BPM, so the faster track is the one at 150 BPM.

Answer

guitar_FutureofTears

folk_Standout electronic_RestAssured

Question

In one sentence, how do the two tracks differ in genre, energy, and production approach?

Reasoning

Step 1: Identify folk_Standout as laid-back Reggae/Dub with warm, spacious production. Step 2: Identify electronic_RestAssured as high-energy Electro-Industrial/EBM with aggressive synths and tight compression. Step 3: Contrast energy (laid-back vs high-energy), genre (organic reggae/dub vs electronic industrial/EBM), and production (warm delays/reverbs vs aggressive compression and distortion).

Answer

folk_Standout is a laid-back Reggae/Dub track with warm, spacious production and mellow one-drop grooves, whereas electronic_RestAssured is a high-energy Electro-Industrial/EBM track built around aggressive synths, tight compression, and a relentless dance-floor drive.

Baseline Leaderboard

Results on the subset of 2,010 track pairs. Models evaluated under two paradigms: caption-based (Cap) and multi-audio (Multi).

Model	Type	Yes/No Acc.	Short Acc.	BLEU	ROUGE-1	ROUGE-2	ROUGE-L
Music Flamingo	Cap	82.1%	88.1%	2.06	25.3	5.9	19.6
GPT-4o-mini Audio	Multi	77.3%	73.2%	0.87	17.8	2.7	14.7
GPT-4o Audio	Multi	69.6%	84.4%	1.04	20.0	3.6	16.7
Qwen3-Omni	Multi	59.7%	75.5%	1.22	19.9	3.9	16.0
Qwen2-Audio	Multi	34.4%	77.7%	0.57	14.3	1.4	11.3
Qwen2-Audio	Cap	26.9%	55.5%	0.58	15.3	1.6	12.1
MU-LLaMA	Cap	23.6%	50.5%	0.68	16.1	1.9	13.3

Key finding: Caption-based Music Flamingo achieves the strongest overall performance, suggesting that high-quality intermediate representations are more important than direct multi-audio access for comparative reasoning.

Error Analysis

Dominant error types on incorrectly answered sentence-level questions (LLM judge score < 3).

Comparative Collapse

Model avoids explicit comparison and produces a generic summary of both tracks.

Music Flamingo 35.3% GPT-4o 56.7% Qwen3-Omni 23.4%

Attribute Hallucination

Model introduces musical attributes not supported by the audio or captions.

Music Flamingo 64.0% GPT-4o 35.3% Qwen3-Omni 75.9%

Granularity Mismatch

Comparison is too coarse or too fine-grained relative to the question intent.

Music Flamingo 0.7% GPT-4o 8.0% Qwen3-Omni 0.7%

Jamendo-MT-QA: A Benchmark forMulti-Track Comparative Music Question Answering

Abstract

Dataset at a Glance

Data Construction Pipeline

Caption Generation

Single-Track QA

Comparative QA Generation

Quality Filtering

Dataset Examples

Baseline Leaderboard

Error Analysis

Comparative Collapse

Attribute Hallucination

Granularity Mismatch

Jamendo-MT-QA: A Benchmark for
Multi-Track Comparative Music Question Answering