UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
Abstract
Recent advances in diffusion models and vision-language models (VLMs) have significantly enhanced the controllability of image editing. Methods like FlowEdit enable step-by-step editing along a visible, noise-free trajectory, where each intermediate result is a clear image, eliminating the need for full noise inversion. However, these approaches still operate in pixel space or VAE latent space, where intermediate outputs often suffer from visual artifacts, distortions, or unrealistic details—making reliable semantic evaluation difficult. Furthermore, they remain open-loop systems, applying static edits without feedback to guide or correct the editing process adaptively. We propose UniEdit-I, the first training-free, closed-loop image editing framework that operates entirely within the semantic latent space of a unified VLM by introducing an Understanding–Editing–Verifying (UEV) loop:(1) Understanding: parses the source image and editing instruction into a structured source prompt and a minimal target specification;(2) Editing: applies dynamic semantic offsets, with a configurable feedback weighting mechanism that adaptively modulates editing intensity based on real-time alignment feedback;(3) Verifying: leverages the VLM’s own multimodal reasoning capability to evaluate the intermediate output along multiple semantic dimensions and trigger early stopping or refinement. By transforming the VLM from a post-hoc evaluator into an in-process conductor, UniEdit-I establishes the first semantics-driven, self-correcting closed-loop image editing pipeline. Evaluated on GEdit-Bench, UniEdit-I achieves state-of-the-art performance without any fine-tuning or architectural modifications, and even surpasses several large-scale pre-trained editors.