A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation
Abstract
Controllable hand image generation aims to synthesize geometrically accurate images with consistent appearance. Recently, diffusion models have achieved remarkable success in image generation and have been applied for hand image synthesis. Through input-level fusion or feature-level modulation, existing methods inject control signals with fixed strength across all denoising timesteps. However, this static modulation ignores the progressive characteristic of the denoising process. In this paper, we reveal that the modulation process of control signals depends on the denoising state and the conditions complexity. Due to their distinct semantic distributions and information densities, it remains challenging to achieve effective interaction of these heterogeneous representations. To address this, we propose a Temporal and Content Co-Awareness Latent Diffusion method that designs a temporal- and content-driven modulation mechanism for controllable hand image generation. To achieve temporal and content co-awareness among the heterogeneous representations, we propose a query-based interaction mechanism designed to mitigate information redundancy and align semantic distributions. Leveraging this cross-domain interaction, the model infers the control information required at the current denoising state and dynamically adjust pose and appearance injection strengths. To obtain a stable appearance representation from multi-pose images of the same identity, we design the Pose-Invariant Appearance Encoder that captures both global appearance consistency and local texture details. Furthermore, we employ a feature orthogonal decomposition to mitigate pose leakage into appearance subspaces. Both quantitative and qualitative experimental results demonstrate the superiority of our method over the state of the arts.