Skip to yearly menu bar Skip to main content


Poster

The Photographer's Eye: Teaching Multimodal Large Language Models to See, Think and Critique Like Photographers

Daiqing Qi · Handong Zhao · Jing Shi · Simon Jenni · Yifei Fan · Franck Dernoncourt · Scott Cohen · Sheng Li


Abstract:

Photographer, curator, and former director of photography at the Museum of Modern Art (MoMA), John Szarkowski remarked in William Eggleston’s Guide, “While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky.” Szarkowski insightfully revealed a notable gap between general and aesthetic visual understanding: while the former emphasizes identifying factual elements in an image (the sky), the latter transcends mere object identification, viewing it instead as an aesthetic component—a pure expanse of blue, valued purely as a color block in visual aesthetics. Such distinctions between general visual understanding (detection, localization, etc.) and aesthetic perception (color, lighting, composition, etc.) pose a significant challenge for existing Multimodal Large Language Models (MLLMs) in comprehending image aesthetics, which is increasingly needed in real-world applications, from image recommendation and enhancement to generation. To fundamentally advance the aesthetic understanding of MLLMs, we introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, distinguished by its large scale, expertise, and diversity. Additionally, we propose a new model, PhotoEye, an MLLM featuring a language-guided multi-view vision fusion mechanism for understanding image aesthetics from multiple perspectives. Finally, we introduce PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. Our model demonstrates significant advantages over both open-source and commercial models on existing benchmarks and PhotoBench.

Live content is unavailable. Log in and register to view live content