Unexplored Flaws in Multiple-Choice VQA Evaluations
⚠️ Preprint, 2025
Overview
Do you trust MLLM benchmark scores? You probably shouldn’t — at least not blindly. We uncover previously unknown biases in multiple-choice VQA evaluations that have nothing to do with answer order and everything to do with how the prompt is formatted.
Through a large-scale study spanning 7 MLLMs, 5 VQA datasets, and 48 prompt format variations, we show that even semantically neutral formatting changes cause dramatic score swings — and existing bias mitigations fail to fix them.
Key Findings
📝 3 new bias factors identified in prompt formatting alone
🔬 48 prompt variations systematically evaluated across 7 models and 5 datasets
📉 High sensitivity to minor, semantically neutral formatting changes
🚫 Existing mitigations fail — current bias reduction strategies don’t address these biases
🔓 Independent of known issues — persists regardless of answer order bias or model confidence
Key Contributions
🧪 First systematic study of prompt format biases in multiple-choice VQA
📊 Large-scale empirical evidence questioning the reliability of current MLLM benchmarks
⚠️ Demonstrates a fundamental evaluation challenge that the community must address
Why It Matters
If benchmark scores change based on whether you use “A)” vs “(A)” in your prompt, can we really compare models fairly? This work is a wake-up call for the MLLM evaluation community — formatting matters more than anyone thought.
Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs. A common way to assess this ability is through multiple-choice Visual Question Answering (VQA). Earlier works have already revealed that these benchmarks are sensitive to answer choice order, a limitation that can be mitigated through careful design. Yet, we highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations. Specifically, we identify three key variation factors in prompt formatting and analyze their impact through a large-scale study involving seven MLLMs and five VQA datasets, spanning 48 distinct prompt format variations. Our findings reveal that multiple-choice VQA is highly sensitive to minor prompt format changes, even when these changes are semantically neutral. We further demonstrate that these biases persist independently of known order biases or the MLLM’s confidence in the correct answer. Finally, we demonstrate that existing bias mitigation strategies fail to address these newly identified biases.
@article{Rosenthal2025,title={Unexplored flaws in multiple-choice VQA evaluations},author={Rosenthal, Fabio and Schmidt, Sebastian and Graf, Thorsten and Bagodonat, Thorsten and G\"{u}nnemann, Stephan and Schwinn, Leo},year={2025},month=nov,journal={ArXiv},volume={2511.22341},url={http://arxiv.org/abs/2511.22341},}