CVPR Poster FastVLM: Efficient Vision Encoding for Vision Language Models

Poster

FastVLM: Efficient Vision Encoding for Vision Language Models

Pavan Kumar Anasosalu Vasu · Fartash Faghri · Chun-Liang Li · Cem Koc · Nate True · Gokula Krishnan Santhanam · Albert Antony · James Gabriel · Peter Grasch · Oncel Tuzel · Hadi Pouransari

ExHall D Poster #378

[ Abstract ] [ Project Page ]

Sat 14 Jun 3 p.m. PDT — 5 p.m. PDT

Abstract: Vision Language Models (VLMs) like LLaVA encode images into tokens aligned to the word embedding space of the LLM decoder. Scaling input image resolution is essential for improving performance, especially in text-rich image understanding tasks. However, popular visual encoders such as CLIP-pretrained ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. In this work, we introduce FastVLM, which achieves an optimized trade-off between resolution, latency, and accuracy by incorporating FastViTHD—a new hybrid vision encoder that outputs fewer tokens and significantly reduces encoding time while processing high-resolution images. We provide a comprehensive efficiency analysis of the interplay between image resolution, vision latency, number of visual tokens, and LLM size. In the LLaVA-1.5 setup, we achieve 3.2

\times

$\times$ improvement in overall time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. On text-rich evaluations like TextVQA and DocVQA, FastVLM obtains +8.4\% and +12.5\% better accuracy than ConvLLaVA at a similar operating point of 144 visual tokens. Compared to LLaVa-OneVision at the highest resolution (1152

\times

$\times$ 1152), FastVLM achieves comparable performance on key benchmarks like SeedBench and MMMU, using the same LLM, but with 85

\times

$\times$ faster TTFT, 3

\times

$\times$ less vision instruction tuning data, and a vision encoder that is 3.4

\times

$\times$ smaller.

Live content is unavailable. Log in and register to view live content