VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset
Zhizhou Chen ⋅ Shanyan Guan ⋅ Zhanxin Gao ⋅ En Ci ⋅ Yanhao Ge ⋅ Wei Li ⋅ Zhenyu Zhang ⋅ Jian Yang ⋅ Ying Tai
Abstract
Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency textual details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image editing, comprising 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution ($\geq$4096×4096) and is filtered through a rigorous multi-stage pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity. For the second challenge, we propose a high-frequency-aware post-adaptation strategy that allows previous non-high-resolution models to accurately generate fine-grained, high-frequency details. We further present VINS-4KEval, a benchmark covering diverse editing types, to facilitate consistent evaluation in UHR settings. Experiments confirm that our work delivers superior fine-grained detail and texture realism in UHR image editing. The dataset and code will be released.
Successful Page Load