DynFusion: Rethinking Condition Fusion for Adaptive Multi-Conditional Text-to-Image Generation; Bing Wang
Abstract
Text-to-image diffusion models have achieved remarkable progress, generating visually realistic and semantically coherent images from textual prompts. However, natural language alone lacks the precision required for design-centric applications that demand strict spatial and structural fidelity—particularly when representing complex concepts that integrate multi-level information, such as product or scene design.To address this limitation, controllable diffusion frameworks introduce auxiliary conditions (e.g., depth, edge, or reference images) to guide the generative process. Models like ControlNet and IP-Adapter effectively inject such priors, improving structural or appearance alignment.Yet, real-world design tasks rarely depend on a single type of condition. They often require simultaneous integration of multiple heterogeneous cues—for instance, preserving spatial layout from depth maps, structural outlines from edge maps, and stylistic attributes from reference images. Current approaches either handle only one condition or naively stack multiple ones, resulting in computational inefficiency and conflicting guidance that degrade generation quality.This multi-condition inconsistency forms a critical bottleneck for applying diffusion models to real-world design workflows, motivating our proposed framework. We propose a data-driven adaptive condition fusion mechanism for multi-conditional diffusion. Our method introduces a novel condition adaptation module that dynamically selects and fuses subsets of conditions based on the diffusion timestep, task characteristics, and feature injection position. This adaptive strategy harmonizes diverse structural and appearance priors, achieving controllable yet flexible generation in complex design scenarios. Experiments demonstrate significant improvements in fidelity, consistency, and controllability across multi-condition tasks, establishing a new direction for practical, detail-preserving diffusion-based design generation.