Poster Fri, Jun 5, 2026 • 3:00 PM – 5:00 PM PDT ExHall A & F 275

Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

Byung-Kwan Lee ⋅ Yu-Chiang Frank Wang ⋅ Ryo Hachiuma

Abstract

Large-scale vision–language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teacher. However, distilling knowledge from large teacher to small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking teacher and reinforcing student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks and non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher from mask to gradually increase the teacher capacity during training. This strategy allows the student to learn richer representations of teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of their responses' transferability from teacher to student. Unlike online think–answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling the students to achieve strong performance without requiring the think–answer process. Extensive experiments across diverse VLM benchmarks demonstrate that Masters outperforms existing compact VLMs and partially surpasses large ones, while being far more efficient. Moreover, gradually increasing the teacher sizes during distillation (e.g., from 14B to 38B) yields smoother convergence and stronger generalization than one-shot distillation (e.g., 38B), revealing a scalable path toward efficient and deployable VLMs.