Inter-Edit: First Benchmark for Interactive Instruction-Based Image Editing
Delong Liu ⋅ Haotian Hou ⋅ Zhaohui Hou ⋅ Zhiyuan Huang ⋅ Shihao Han ⋅ Mingjie Zhan ⋅ Zhicheng Zhao ⋅ Fei Su
Abstract
Precise and controllable image editing remains a significant challenge. Current methods often rely on text prompts, but achieving accurate spatial localization solely through descriptions is inherently difficult. Mask-based approaches, though offering better control, typically require overly precise user annotations, thus increasing user burden and leading to unnatural results. To bridge this gap, we introduce the **I**nteractive **I**nstruction-based **I**mage **E**diting ($I_3E$) task, which generates high-quality edits from a more intuitive combination: concise text instructions and imprecise spatial guidance. To address the critical lack of suitable data, we propose an efficient pipeline to generate Inter-Edit, a new million-scale training dataset that simulates realistic user masks---not strictly segment-aligned. We also present a comprehensive benchmark, featuring a meticulously human-annotated test set that captures diverse, localization-dependent editing scenarios and realistic user interaction patterns. To evaluate this task, we introduce a new suite of position-aware metrics that strongly correlate with human perceptual judgments. Finally, we develop three baseline models trained on Inter-Edit. Extensive experiments demonstrate that our methods significantly enhance $I_3E$ performance, achieving substantial improvements in localization and edit quality, and outperforming existing state-of-the-art models. The Inter-Edit dataset and all related code will be made publicly available.
Successful Page Load