Skip to yearly menu bar Skip to main content


Poster

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Bingda Tang · Sayak Paul · Boyang Zheng · Saining Xie


Abstract:

Recent advances in text-to-image synthesis have delivered impressive results, yet existing approaches still struggle to align with complex prompts. While decoder-only Large Language Models (LLMs) excel at handling such intricate texts, their integration with text-to-image generative models remains unsatisfactory. The rise of Diffusion Transformers (DiTs) presents a promising path forward via the deep fusion with LLMs. In this work, we explore this deep fusion for text-to-image synthesis by replacing the text stream Transformer in the MM-DiT model with an LLM, establishing shared self-attention between the LLM and DiT models. This design better aligns with the training objective and inference nature of both autoregressive and diffusion models, brigding the gap between the two paradigms. We empirically examine the design spaces of this approach and demonstrate its effectiveness through extensive experiments. We hope the positive evidence will kindle interest in this approach and inspire reflection on the pursuit of utilizing LLMs for text-to-image synthesis.

Live content is unavailable. Log in and register to view live content