Poster Fri, Jun 5, 2026 • 9:45 AM – 11:45 AM PDT ExHall A-F 579

MMBench-GUI: A Unified Hierarchical Evaluation Framework for Multi-Platform GUI Agents

Xuehui Wang ⋅ Zhenyu Wu ⋅ JingJing Xie ⋅ Zichen Ding ⋅ Bowen Yang ⋅ Zehao Li ⋅ Zhaoyang Liu ⋅ Qingyun Li ⋅ Xuan Dong ⋅ Zhe Chen ⋅ Weiyun Wang ⋅ Xiangyu Zhao ⋅ Jixuan Chen ⋅ Haodong Duan ⋅ Tianbao Xie ⋅ Chenyu Yang ⋅ Shiqian Su ⋅ Yue Yu ⋅ Yanting Zhang ⋅ Xiangyu Yue ⋅ Weijie Su ⋅ Xizhou Zhu ⋅ Wei Shen ⋅ Jifeng Dai ⋅ Wenhai Wang

Paper PDF

Abstract

We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web. The benchmark spans four levels: Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. To assess both effectiveness and efficiency, we further propose the Efficiency–Quality-Aware (EQA) metric, which measures task success alongside action redundancy. Extensive evaluations reveal that precise visual grounding is the critical determinant of performance, underscoring the advantages of modular designs with specialized grounding modules. Moreover, all agents suffer from substantial inefficiencies, frequently completing tasks with excessive steps despite eventual success. Performance also degrades on complex or cross-application tasks, exposing weaknesses in memory, planning, and adaptive reasoning. By providing broad coverage, standardized protocols, and novel metrics, MMBench-GUI establishes the first comprehensive foundation for advancing GUI agent research.