Skip to yearly menu bar Skip to main content


InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen · Jiannan Wu · Wenhai Wang · Weijie Su · Guo Chen · Sen Xing · Zhong Muyan · Qing-Long Zhang · Xizhou Zhu · Lewei Lu · Bin Li · Ping Luo · Tong Lu · Yu Qiao · Jifeng Dai

Arch 4A-E Poster #412
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT
Oral presentation: Orals 6C Multi-modal learning
Fri 21 Jun 1 p.m. PDT — 2:30 p.m. PDT


The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models.

Live content is unavailable. Log in and register to view live content