UI-Lens: Assessing General MLLMs’ Potential to Automate UI Display Quality Assurance
Abstract
User Interface (UI) display defect detection poses challenges far beyond UI understanding, requiring fine-grained element boundary understanding, missing-content detection, and reasoning about sequential interface semantic consistency. However, the capabilities of multimodal large language models (MLLMs) and vision-language models (VLMs) for detecting UI defects in realistic, complex interfaces have not been systematically validated. To fill this gap, we present UI-Lens, the first multi-dimensional UI display detection benchmark for Chinese-language UI scenarios. The dataset comprises 4,759 pages meticulously annotated by design experts, covering six core display defect categories. We conduct a systematic evaluation of 10 mainstream models (8 closed-source, 2 open-source). Results show clear shortcomings in current models: for tasks requiring fine-grained element boundary understanding, performance is near random, with task-average F1 scores of 20.36% and 31.21% on Text Overflow and Container Overlap, respectively; for sequential interface semantic consistency (e.g., Text Inconsistency), the task-average F1 score is only 10.61%, indicating severe underperformance. We release UI-Lens to catalyze research toward more robust UI display defect detection with fine-grained boundary awareness in realistic, complex interfaces.