Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models
Jingchen Sun ⋅ Shaobo Han ⋅ Deep Patel ⋅ Wataru Kohno ⋅ Can Jin ⋅ Changyou Chen
Abstract
Knowledge distillation establishes a learning paradigm that learns from both data supervision and teacher guidance. However, the optimal weighting between learning from data and learning from the teacher is hard to determine, as some samples are data-noisy while others are teacher-uncertain. This raises a pressing need to adaptively balance data and teacher supervision. We propose Beta-weighted Knowledge Distillation \textbf{$\beta$-KD}, an adaptive, uncertainty-aware knowledge distillation framework that supports arbitrary distillation objectives under a unified Bayesian formulation. Specifically, we model teacher signals as a Gibbs prior over student activations and use amortized optimization to jointly infer activations and weighting parameters $\beta$, leading to a closed-form, uncertainty-aware weighting. Extensive experiments distilling a 1.7B-parameter student from MobileVLM-7B demonstrate that $\beta$-KD consistently outperforms existing methods under different loss combination settings. Moreover, large-scale distillation and evaluations on six multimodal benchmarks further confirm the effectiveness of the proposed approach.
Successful Page Load