Re-evaluating Continual VQA: Toward Fair and Robust Evaluation for Multimodal Continual Learning
Abstract
Continual Visual Question Answering (Continual VQA) poses unique challenges for multimodal continual learning, requiring models to incrementally acquire new knowledge while preserving visual–semantic grounding across tasks. However, existing benchmarks hinder fair and robust evaluation of such capabilities, as they allow models to exploit dataset biases rather than demonstrate genuine continual reasoning. We identify two structural flaws in current benchmark design. First, shared answer vocabularies across tasks encourage answer memorization, inflating performance and underestimating forgetting. Second, static answer priors within each task make the training and test answer distributions nearly identical, obscuring robustness under distribution shifts. To address these issues, we introduce UCo-VQA, an Unbiased benchmark suite that enforces token-level disjoint answer spaces across tasks and introduces intra-task train–test distribution shifts, enabling fairer assessment of forgetting and generalization in multimodal continual learning. We further provide a parameter-efficient baseline that mitigates forgetting and enhances grounding through question-only replay and dual-level distillation, offering a lightweight and memory-efficient framework for continual adaptation. Extensive experiments on UCo-VQA reveal that prior methods substantially overestimate performance under biased setups, while our approach achieves state-of-the-art results, improving robustness and retention by up to 4.18% and 2.21%, respectively.