Skip to yearly menu bar Skip to main content


Poster

Synthetic Visual Genome

Jae Sung Park · Zixian Ma · Linjie Li · Chenhao Zheng · Cheng-Yu Hsieh · Ximing Lu · Khyathi Chandu · Quan Kong · Norimasa Kobori · Ali Farhadi · Yejin Choi · Ranjay Krishna


Abstract: Understanding and reasoning over visual relationships—spatial, functional, interactional, social, etc.—are considered to be a fundamental component of human cognition.Yet, despite the major advances in visual comprehension in multimodal language models, precise reasoning over relationships remains a challenge. We introduce Robin: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train Robin, we curate SVG, a scene graph based instruction tuning dataset containing 33K images and 855K relationships for 170K objects by completing the missing relations in existing scene graphs using GPT4-V and a carefully designed filtering process—combining rule-based and model-based filtering techniques to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o refines Robin's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. Results show that our Robin-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of 88.2, surpassing the previous best of 87.4. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning tasks.

Live content is unavailable. Log in and register to view live content