Foundation Encoders Are All You Need for Preference-Aware Personalization
Abstract
Personalized image generation based on user behaviors reflects individual preferences with minimal user intervention. However, existing studies often rely on inaccurate profiling, high resource costs, and model-specific designs, which jointly restrict creativity, diversity, and generality. To address these limitations, we propose FANG, a novel approach that enables personalization using only foundation encoders, without additional structures. FANG performs tailored profiling to capture user preferences, and reconstructs transformer-based encoders to integrate them while preserving target fidelity. Experiments show that FANG achieves robust, high-quality personalization across various foundation text-to-image models and applications (e.g., CLIP retrieval, unCLIP, vision-language models), seamlessly integrating into diverse encoders without fine-tuning.