VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis
Abstract
Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models’ performance in real-world, we introduce Long Goal Bench(LGBench), a 2000-task suite (1000 T2I, 1000 I2I) whose average instruction contains 18---22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find even state-of-the-art commercial APIs satisfy fewer than 72\% of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free, vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling plus semantic verification/rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 vs.\ 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (+7\% overall), and ImgEdit (+0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing. The code, benchmark, and evaluation scripts will be released.