NEAF: Natural Image Editing with Attention Fusion for Generalizable Test-time Optimization in Text-Guided Image Editing
Abstract
Diffusion-based text-to-image (T2I) models have enabled remarkable generative capabilities, yet precise text-based image editing that preserves the original’s structural and perceptual fidelity remains non-trivial. Existing approaches either rely on retraining with large bespoke datasets, incurring significant computational and curation costs, or adopt lightweight fine-tuning strategies that still require optimization and often fail in fine-grained or semantically complex edits.We propose NEAF (Natural image Editing with Attention Fusion), a novel zero-shot, universal tuning-free framework for arbitrary T2I models, obviating the need for dataset curation or retraining. NEAF introduces a lightweight, learnable XA-Conductor module that dynamically identifies salient cross-attention contributions pertinent to the edit. This module optimizes a weight vector to orchestrate an adaptive fusion of cross-attention maps derived from the source, edited, and reconstruction branches. This triadic-feedback optimization strategy ensures the precise instantiation of user directives while rigorously preserving the fidelity of quiescent regions.Extensive experiments validate NEAF as a flexible and general framework that consistently surpasses existing methods across diverse editing tasks, demonstrating particular dominance in complex, non-rigid editing scenarios where other approaches falter.