Bias at the End of the Score
Abstract
Reward models (RMs) are inherently non-neutral value functions designed and trained to encode specific objectives, such as human preferences or text-image alignment. RMs have become crucial components of text-to-image (T2I) generation systems where they are used during pretraining, finetuning of models and test-time optimization and post-generation safety and quality filtering of T2I outputs. While specific problems with the integration of RMs into the T2I pipeline have been studied (e.g. reward hacking or mode collapse during training), their robustness and fairness as scoring functions remains largely unknown. We conduct a large-scale audit of RMs' robustness with respect to demographic biases during T2I model training and generation. We provide quantitative and qualitative evidence that while originally developed as quality measures, RMs encode demographic biases, which cause reward-guided optimization to sexualize female images (especially darker-skinned females), reinforce gender/racial stereotypes, and collapse demographic diversity. These findings highlight the shortcomings of current RMs, challenging their reliability as quality metrics and underscoring the critical need for alternative data collection, training, and optimization procedures to establish more robust scoring.