On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models
Abstract
Multimodal Continual Instruction Tuning aims to continually enhance Large Vision-Language Models by learning from new data without forgetting previously acquired knowledge. Mixture-of-Experts (MoE) architectures support this by adding new experts and expanding routers while keeping existing ones frozen. However, despite expert isolation, they still suffer from forgetting due to router drifting, where old-task tokens are mistakenly attracted to newly added experts, leading to performance degradation, \ie, forgetting. We propose a dynamic MoE approach with drift-aware token assignment to regularize router drifting and mitigate forgetting. We analyze the failure mode and identify its link to how different token types are assigned during training. In particular, tokens with ambiguous assignments between old and new experts tend to cause problems, although some can still be benign or even beneficial.Motivated by this, our proposed LLaVA-DyMoE incrementally expands the MoE and learns with a two-fold regularization strategy that regularizes token assignment and dispatching by representing token types through their routing scores, reducing router drift. Our drift-aware token assignment guidance provides conditional guidance for ambiguous tokens to preserve old patterns, complemented by a pair of synergistic routing losses that enforce separation and promote new expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE outperforms baselines, achieving over a 7\% increase in average accuracy and a 12\% reduction in forgetting by mitigating this router-drift–induced forgetting.