White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation
Abstract
Color constancy aims to keep object colors consistent under varying illumination. Cross-camera generalization in color constancy remains challenging because learning-based models often overfit to the color response characteristics of the training camera, resulting in degraded performance on images captured by other cameras.We propose VLM-CC, a vision-language model (VLM)-guided framework that formulates color constancy as an iterative refinement process.Instead of directly estimating the illuminant from raw input, VLM-CC performs iterative correction driven by VLM-based evaluation.At each iteration, the image is white-balanced using the current estimate and converted to pseudo-sRGB. A lightweight LoRA-tuned VLM then assesses the corrected image, identifying the dominant residual color cast and providing qualitative feedback.This feedback is mapped to a residual illumination direction (red, green, or blue) and used to update the illuminant estimate until convergence.Our key idea is to reframe color constancy as an iterative perceptual feedback problem, leveraging VLM evaluation instead of direct RGB regression.By replacing direct RGB estimation with VLM-guided perceptual feedback, VLM-CC achieves state-of-the-art robustness in cross-camera color constancy across multiple datasets.