Skip to yearly menu bar Skip to main content


AssistGUI: Task-Oriented PC Graphical User Interface Automation

Difei Gao · Lei Ji · Zechen Bai · Mingyu Ouyang · Peiran Li · Dongxing Mao · Qin WU · Weichen Zhang · Peiyi Wang · Xiangwu Guo · Hengxu Wang · Luowei Zhou · Mike Zheng Shou

Arch 4A-E Poster #352
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. We carefully collected a set of 100 tasks from nine widely-used software applications, such as, After Effects and MS Word, each accompanied by the necessary project files for better evaluation. Moreover, we propose a multi-agent collaboration framework, which incorporates four agents to perform task decomposition, GUI parsing, action generation, and reflection. Our experimental results reveal that our multi-agent collaboration mechanism outshines existing methods in performance. Nevertheless, the potential remains substantial, with the best model attaining only a 46\% success rate on our benchmark. We conclude with a thorough analysis of the current methods' limitations, setting the stage for future breakthroughs in this domain.

Live content is unavailable. Log in and register to view live content