Skip to yearly menu bar Skip to main content


Poster

Generative Multimodal Models are In-Context Learners

Quan Sun · Yufeng Cui · Xiaosong Zhang · Fan Zhang · Qiying Yu · Yueze Wang · Yongming Rao · Jingjing Liu · Tiejun Huang · Xinlong Wang


Abstract:

Humans can easily solve multimodal tasks in context, with only a few demonstrations or simple instructions, which current multimodal systems largely struggle to imitate. In this work, we demonstrate that by effectively scaling up generative multimodal models, their task-agnostic in-context learning capabilities can be significantly enhanced.We introduce Emu2, a generative multimodal model with 37 billion parameters, which serves as a base model and general-purpose interface for a variety of multimodal tasks. Emu2 not only achieves strong performance in few-shot setting, but can also be instruct-tuned to follow specific instructions such as visual question answering and object-grounded image generation.Emu2 even emerges to solve tasks that require on-the-fly reasoning, such as visual prompting, which existing models are unlikely to handle. We identify additional tasks where Emu2's in-context learning can further improve, and discuss its broader societal impact.Our code and models will be made publicly available to facilitate future research.

Live content is unavailable. Log in and register to view live content