Aligning Multi-Character Narrative Image Generation with Multi-Aspect Human Preferences
Abstract
Narrative image generation aims to create images featuring multiple distinct characters while capturing their interrelationships, posing significant challenges for current text-to-image diffusion models. As a result, general personalized methods often suffer from poor semantic alignment, identity blending, and aesthetic implausibility.These issues are inadequately captured by existing evaluation metrics such as CLIP, ArcFace, and conventional reward models, which fundamentally fail to align with human perceptual preferences. To align with human preferences, we first construct a fine-grained human preference dataset, NI-RLHF, by collecting both detailed human critiques and preference judgments across three core dimensions: prompt following, identity consistency, and visual quality.This comprehensive dataset facilitates the training of NIReward, a critique-based reward model capable of generating interpretable image evaluations.Building upon the interpretable reward signal from NIReward, we propose Adaptive Dominance-based Preference Optimization (ADPO) to balance learning across diverse preference dimensions while dynamically adapting to reward margins.Experimental results indicate that NIReward significantly outperforms existing evaluation models and reward models, and ADPO yields a significant improvement across the three key preference dimensions. By introducing NIReward and ADPO, our work paves the way for generating narrative images aligned with actual human preferences.