Abstract:
Understanding and reasoning over visual relationships—spatial, functional, interactional, social, etc.—are considered to be a fundamental component of human cognition.Yet, despite the major advances in visual comprehension in multimodal language models, precise reasoning over relationships remains a challenge. We introduce Robin: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train Robin, we curate SVG, a scene graph based instruction tuning dataset containing $33K$ images and $855K$ relationships for $170K$ objects by completing the missing relations in existing scene graphs using GPT4-V and a carefully designed filtering process—combining rule-based and model-based filtering techniques to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o refines Robin's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. Results show that our Robin-3B model, despite being trained on less than $3$ million instances, outperforms similar-size models trained on over $300$ million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of $88.2$, surpassing the previous best of $87.4$. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning tasks.
Chat is not available.
Successful Page Load