VQA Evaluation Flaws

Unexplored Flaws in Multiple-Choice VQA Evaluations

⚠️ Preprint, 2025

Overview

Do you trust MLLM benchmark scores? You probably shouldn’t — at least not blindly. We uncover previously unknown biases in multiple-choice VQA evaluations that have nothing to do with answer order and everything to do with how the prompt is formatted.

Through a large-scale study spanning 7 MLLMs, 5 VQA datasets, and 48 prompt format variations, we show that even semantically neutral formatting changes cause dramatic score swings — and existing bias mitigations fail to fix them.

Key Findings

  • 📝 3 new bias factors identified in prompt formatting alone
  • 🔬 48 prompt variations systematically evaluated across 7 models and 5 datasets
  • 📉 High sensitivity to minor, semantically neutral formatting changes
  • 🚫 Existing mitigations fail — current bias reduction strategies don’t address these biases
  • 🔓 Independent of known issues — persists regardless of answer order bias or model confidence

Key Contributions

  • 🧪 First systematic study of prompt format biases in multiple-choice VQA
  • 📊 Large-scale empirical evidence questioning the reliability of current MLLM benchmarks
  • ⚠️ Demonstrates a fundamental evaluation challenge that the community must address

Why It Matters

If benchmark scores change based on whether you use “A)” vs “(A)” in your prompt, can we really compare models fairly? This work is a wake-up call for the MLLM evaluation community — formatting matters more than anyone thought.

(Rosenthal et al., 2025)

References

2025

  1. Unexplored flaws in multiple-choice VQA evaluations
    Fabio Rosenthal, Sebastian Schmidt, Thorsten Graf, and 3 more authors
    ArXiv, Nov 2025