Toward Early Quality Assessment of Text-to-Image Diffusion Models
Abstract
Recent text-to-image (T2I) diffusion models can produce highly realistic images from natural language prompts. In practice, users usually generate multiple candidates and select only a small subset for downstream use, guided by automatic metrics like CLIPScore and ImageReward. However, this post-hoc quality assessment is highly resource-intensive since quality is assessed after dozens to hundreds of denoising steps per image, leading to substantial waste on low-quality samples. To address this issue, we propose \textbf{Probe-Select}, a plug-in framework for early quality assessment in T2I generation. Our key observation is that certain intermediate features within the denoiser—often as early as 20\% of the reverse process—already encode stable structural cues (e.g., object layout, spatial composition, and color harmony) that strongly correlate with final image fidelity. Building upon this phenomenon, Probe-Select attaches lightweight probes to these stable activations at an early checkpoint and trains them to align with external evaluators. During inference, the probes forecast image quality on the fly, enabling early pruning of unpromising trajectories so that computation is concentrated on promising ones. Experiments on MS-COCO across multiple generative backbones show that this early assessment mechanism reduces sampling cost by over 60\% while improving the quality of the generated images, demonstrating that early structural signals can effectively guide efficient text-to-image generation.