Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Jiasen Lu · Christopher Clark · Sangho Lee · Zichen Zhang · Savya Khosla · Ryan Marten · Derek Hoiem · Aniruddha Kembhavi

award Highlight
We present Unified-IO 2, a multimodal and multi-skill unified model capable of following novel instructions. Unified-IO 2 can use text, images, audio, and/or videos as input and can generate text, image, or audio outputs, which is accomplished in a unified way by tokenizing these different inputs and outputs into a shared semantic space that can then be processed by a single encoder-decoder transformer model. Unified-IO 2 is trained from scratch on a custom-built multimodal pre-training corpus and then learns an expansive set of skills through fine-tuning on over 120 datasets, including datasets for segmentation, object detection, image editing, audio localization, video tracking, embodied AI, and 3D detection. To facilitate instruction-following, we add prompts and other data augmentations to these tasks to allow Unified-IO 2 to generalize these skills to new tasks zero-shot.Unified-IO 2 is the first model to be trained on such a diverse and wide-reaching set of skills and unify three separate generation capabilities. Unified-IO 2 achieves state-of-the-art performance on the multi-task GRIT benchmark and achieves strong results on 30 diverse datasets, including SEED-Bench image and video understanding, TIFA image generation, VQA 2.0, ScienceQA, VIMA robotic manipulation, VGG-Sound, and Kinetics-Sounds and can perform unseen tasks and generate free-form responses. We release our model and code to facilitate future work.

