On Scaling Up a Multilingual Vision and Language Model
Xi Chen ⋅ Josip Djolonga ⋅ Piotr Padlewski ⋅ Basil Mustafa ⋅ Soravit Changpinyo ⋅ Jialin Wu ⋅ Carlos Riquelme Ruiz ⋅ Sebastian Goodman ⋅ Xiao Wang ⋅ Yi Tay ⋅ Siamak Shakeri ⋅ Mostafa Dehghani ⋅ Daniel Salz ⋅ Mario Lučić ⋅ Michael Tschannen ⋅ Arsha Nagrani ⋅ Hexiang Hu ⋅ Mandar Joshi ⋅ Bo Pang ⋅ Ceslee Montgomery ⋅ Paulina Pietrzyk ⋅ Marvin Ritter ⋅ AJ Piergiovanni ⋅ Matthias Minderer ⋅ Filip Pavetic ⋅ Austin Waters ⋅ Gang Li ⋅ Ibrahim Alabdulmohsin ⋅ Lucas Beyer ⋅ Julien Amelot ⋅ Kenton Lee ⋅ Andreas Steiner ⋅ Yang Li ⋅ Daniel Keysers ⋅ Anurag Arnab ⋅ Yuanzhong Xu ⋅ Keran Rong ⋅ Alexander Kolesnikov ⋅ Mojtaba Seyedhosseini ⋅ Anelia Angelova ⋅ Xiaohua Zhai ⋅ Neil Houlsby ⋅ Radu Soricut
2024 Poster
Abstract
We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
Chat is not available.
Successful Page Load