Otil: Accelerating Diffusion Model Inference via Communication-Efficient Multi-GPU Parallelism
Xin Li ⋅ Shujun Tian ⋅ Tao Lu ⋅ Han Bao ⋅ Zonghui Wang ⋅ Chen
Abstract
Diffusion models (DMs) have recently achieved remarkable success across diverse modalities, including high-fidelity image and video synthesis.However, their inherent step sequential denoising process introduces substantial cumulative latency, which significantly degrades user experience. While existing multi-GPU parallelization motheds can alleviate latency, they often incur prohibitive GPU-GPU communication overhead, offsetting much of the performance gain.We present Otil (Only Transmit Informative Latents), a communication-efficient parallel framework for accelerating diffusion inference.% Otil can minimizes redundant data exchange across GPUs while preserving generation quality.Our key insight is that latent activations change only marginally between consecutive denoising process. Leveraging this property, Otil identifies and synchronizes only the most informative latent sub-blocks and introduces a dynamic polling mechanism that periodically revisits all spatial regions, ensuring complete coverage without unnecessary communication. The framework is fully plug-and-play and remains compatible with fast sample and architectural acceleration algorithms, without requiring any retraining or architectural modification.Otil reduces GPU–GPU communication up to 87.5\% compared with SOTA parallelism methods, achieving $1.8\times$ speedup on two GPUs with Stable Diffusion v1.5 and $2.6\times$ on four GPUs with Stable Diffusion XL. When combined with few-step samplers ($30$ steps) and LoRA models, the acceleration further increases to $2.46\times$–$2.84\times$ on 2 GPUs. These demonstrate the strong potential of Otil for scalable and efficient multi-GPU diffusion inference while preserving generation fidelity.
Successful Page Load