Interpretable Prompts made Edit-Friendly: Token-to-Token Similarity Reduction in dLLMs for Edit-Friendly Hard Prompt Inversion
Naresh Kumar Devulapally ⋅ Shruti Agarwal ⋅ Vishal Asnani ⋅ Vishnu Suresh Lokhande
Abstract
Crafting prompts via Prompt Engineering that steer a model’s internal representations toward specific and pre-defined outcomes can be time-consuming, often requiring multiple iterations. Hard Prompt Inversion offers a complementary workflow: start from a reference image and generate a prompt that conditions a text-to-image (T2I) model to reconstruct the reference image. Existing inversion methods either yield incoherent text, or produce prompts that are overly sensitive to downstream token edits. We propose a dLLM-based prompt inversion framework that yield prompts that are (i) more interpretable to humans, (ii) better aligned with the reference image, and (iii) designed for downstream token swap and token append operations (aka edit-friendly prompts). The method is plug-and-play, requiring no finetuning of either the T2I model or the dLLM. Experiments across three datasets show a $\sim10\times$ reduction in inversion time relative to existing prompt-inversion baselines, higher interpretability scores, and significantly higher prompt editability, as measured by TIFA, GPT-V preference scoring, and controlled user studies, all while preserving high-fidelity image generation. By coupling diffusion-time sampling with token-similarity control inside a dLLM decoder, our approach extends prompt inversion beyond reconstruction to downstream token-editing tasks, enabling faster, more transferable prompts that generalize across multiple T2I models.
Successful Page Load