Paper
in
Workshop: Test-time Scaling for Computer Vision
TTGen: Incorporating Test-time Scaling to Diffusion Models
Yuming Qiao · Yuechen Wang · Xudong Zhang · Dan Meng
Test-time scaling has demonstrated significant potential in enhancing the performance of Large Language Models. Beyond the recent surge in Reinforcement Learning based approaches that bolster models' self-reasoning capabilities, techniques such as Monte Carlo Tree Search (MCTS) sampling and Best-of-N (BoN) sampling have also made remarkable strides in improving the quality of model outputs, thereby advancing the field of Artificial General Intelligence (AGI). Visual generation, exemplified by diffusion models, represents a critical domain within AGI. However, there has been limited research exploring the integration of Test-time Scaling with diffusion models. Motivated by this gap and the inherent compatibility between sampling-based Test-time Scaling and diffusion-based generation, we introduce TTGen, a novel framework that integrates sampling-based test-time scaling methods with diffusion models. TTGen operates through a three-step process: (1) Sampling clean latent within each step, where the clean latent for the current step is sampled based on the predicted noise. (2) Refining step-wise prompts, where the current step's clean latent and the initial query are fed into a LVLM (Large Vision-Language Model), prompting the model to output several revised prompts based on the misalignment between current clean latent and initial query. Revised prompts are then used to guide the denoising direction. (3) BoN selection, where at each sampling step, the best diffusion trajectory is progressively selected based on the CLIP score between the revised latent and the initial prompt, thereby enhancing the quality of image generation. Finally, we conduct a series of experiments to validate the effectiveness of TTGen. It is worth noting that the generation results of TTGen demonstrate a 7.1% improvement in CLIP score and a 13.8% enhancement in FID compared to the direct sampling.