Poster
Learning a Visual Lexicon from Diffusion Models
XuDong Wang · Xingyi Zhou · Alireza Fathi · Trevor Darrell · Cordelia Schmid
We present Visual Lexicon, an image representation that encodes visual information in the text space while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex captures both rich semantic content and fine visual details, facilitating high-quality image generation and visual scene understanding. Using a self-supervised learning pipeline, ViLex generates embeddings optimized for reconstructing input images through a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic level reconstruction. As visual embeddings in the text space, ViLex embeddings can be used independently as text tokens or combined with natural language tokens for zero-shot multimodal image generation. ViLex is also compatible with downstream vision-language tasks like visual question answering and referring expression segmentation, significantly enhancing performance. Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text-based embeddings—even with a single token. ViLex also performs various DreamBooth tasks in a zero-shot manner without the need for fine-tuning T2I models, and serves as a powerful vision encoder, consistently enhancing vision-language model performance across 15 benchmarks compared to a strong SigLIP baseline.
Live content is unavailable. Log in and register to view live content