VQ-VA World: Towards High-Quality Visual Question-Visual Answering
Chenhui Gou ⋅ Zilong Chen ⋅ Zeyu Wang ⋅ Feng Li ⋅ Deyao Zhu ⋅ Zicheng Duan ⋅ Kunchang Li ⋅ Chaorui Deng ⋅ Hongyi Yuan ⋅ Haoqi Fan ⋅ Cihang Xie ⋅ Jianfei Cai ⋅ Hamid Rezatofighi
Abstract
This paper studies \textit{Visual Question–Visual Answering (VQ-VA)}: generating an image, rather than text, in response to a visual question---an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of $\sim$1.8M high-quality, interleaved image–text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of \textit{world knowledge}, \textit{design knowledge}, and \textit{reasoning}. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (\emph{i.e.}, 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (\emph{e.g.}, 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, our work will greatly stimulate future research on VQ-VA.
Successful Page Load