CVPR Poster Generative Multimodal Models are In-Context Learners

Poster

Generative Multimodal Models are In-Context Learners

Quan Sun · Yufeng Cui · Xiaosong Zhang · Fan Zhang · Qiying Yu · Yueze Wang · Yongming Rao · Jingjing Liu · Tiejun Huang · Xinlong Wang

Arch 4A-E Poster #459

[ Abstract ] [ Paper PDF ]

[Paper PDF]

Abstract:

Humans can easily solve multimodal tasks in context, with only a few demonstrations or simple instructions, which current multimodal systems largely struggle to imitate. In this work, we demonstrate that by effectively scaling up generative multimodal models, their task-agnostic in-context learning capabilities can be significantly enhanced.We introduce Emu2, a generative multimodal model with 37 billion parameters, which serves as a base model and general-purpose interface for a variety of multimodal tasks. Emu2 not only achieves strong performance in few-shot setting, but can also be instruct-tuned to follow specific instructions such as visual question answering and object-grounded image generation.Emu2 even emerges to solve tasks that require on-the-fly reasoning, such as visual prompting, which existing models are unlikely to handle. We identify additional tasks where Emu2's in-context learning can further improve, and discuss its broader societal impact.Our code and models will be made publicly available to facilitate future research.

Chat is not available.