ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing
Abstract
Multimodal Web Agents demonstrate a practically valuable capability by fusing information from diverse modalities (e.g., text and vision), retrieved iteratively from the internet, to respond to complex user queries.However, the visual modality is prone to information overload, and the noise contained within it—such as irrelevant background details or complex structures—can disrupt the model's attention, misdirecting its operational focus toward an erroneous path.To address the aforementioned challenge, we propose ReFAct (Reasoning, Focusing, and Acting), a novel framework that empowers the agent to actively manage its cross-modal context. This allows the agent to adjust its operational focus, thereby mitigating the impact of noise on multimodal Web Agents.Specifically, ReFAct employs a Grounding tool for active visual perception to dynamically filter information. We also design external memory-based Defocus/Refocus operations for selective information retention, further modulating information density within the multimodal context. Ultimately, this ensures the agent maintains focus during problem-solving.To evaluate and enhance agent capabilities in complex and noisy multimodal contexts, we first propose a pipeline for constructing datasets with flexible complexity. We introduce a new open-source benchmark: GroundedVQA. Finally, we experimentally demonstrate the effectiveness of our proposed method on GroundedVQA and other widely-used benchmarks.