Cross-Modal Guided Visual Synthesis for Data-Efficient Multimodal Depression Recognition
Abstract
The performance of multimodal learning systems, particularly in high-stakes domains like automated depression recognition, is fundamentally constrained by the challenge of learning robust visual representations from limited and complex clinical data. To overcome this, we introduce Cross-Modal Guided Visual Synthesis (CMG-VS), a novel training framework that internally enhances the learning process by synthesizing new, task-relevant visual features. At its core, CMG-VS leverages the rich context from audio and text modalities to guide a conditional generative model. This model learns the intricate mapping from speech and language to visual expression, generating a diverse manifold of plausible visual behaviors to enrich the training distribution. Crucially, this synthesis is not a separate pre-processing step. Through a task-guided joint optimization scheme, the generative process is dynamically steered by the downstream multimodal recognizer's performance. This closed-loop feedback mechanism ensures the synthesized visual features are optimized to be maximally discriminative for the recognition task, rather than merely realistic. Comprehensive experiments on the widely-used DAIC-WOZ and E-DAIC benchmark datasets demonstrate that CMG-VS significantly outperforms existing state-of-the-art methods across all standard regression and classification metrics. Ablation studies further validate that our task-guided synthesis is the key driver of this performance gain, proving its effectiveness as a new paradigm for robust multimodal representation learning.