Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
Abstract
Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck.Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of expert networks for each input, achieving high scalability with limited computation.Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives gradients only from the experts it selects in each forward pass, its learning signal is highly localized, with little information about the broader expert space.This limited gradient feedback can lead the router toward suboptimal configurations, for example collapsing to only a few experts when no auxiliary losses are used, and it has also been associated with fluctuating expert selections during training. These behaviors suggest that task-driven signals alone do not provide sufficient guidance for learning robust routing behavior in sparse MoE.To address this issue, we propose TGR-MoE: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts, a simple yet effective method that stabilizes router learning using supervision derived from a pretrained dense teacher model.TGR-MoE constructs a teacher router from the teacher's intermediate representations and uses its routing outputs as pseudo-supervision for the student router, suppressing frequent routing fluctuations during training and enabling knowledge-guided expert selection from the early stages of training.Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that TGR consistently improves both accuracy and routing consistency, while maintaining stable training even under highly sparse configurations.