SynthRGB-T: Language-Vision Guided Image Translation for Diversity Synthesis
Abstract
Bridging the modality gap between infrared and visible imagery is critical for cross-modal understanding and for enriching multimodal benchmarks. However, existing approaches remain confined to one-to-one mappings and are typically evaluated on unidirectional or closed-set scenarios. To address this challenge, we present SynthRGB-T, a unified framework for diverse and bidirectional image translation. Specifically, we formulate image translation as a vision-language guided denoising diffusion process, enabling flexible conditioning and open-world generalization. To enhance semantic alignment, a Visual Grounding Pipeline (VGP) is introduced to exploit the world knowledge of foundation models for fine-grained translation guidance. During the diffusion process, we propose to adopt a decoupling injection strategy to alleviate interference among multiple guidance. In addition, a Dual Conditional Cross-Attention (DCCA) module is designed to facilitate collaborative representation learning in latent space. SynthRGB-T is simple and versatile—capable of synthesizing diverse, high-fidelity data that substantially extends multimodal resources within the community. Comprehensive evaluations on multiple real-world benchmarks confirm that SynthRGB-T delivers superior performance and enhanced visual diversity over existing approaches. All code, models, and large-scale synthetic datasets will be released upon camera ready version.