Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
Abstract
Sparse upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from a pretrained dense checkpoint instead of training from scratch.However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization.We propose Cluster-aware Upcycling, a strategy that embeds semantic structure into MoE initialization.The method clusters the dense model’s input activations to identify latent subspaces, initializes each expert using a data-aware truncated SVD of the dense weights within its cluster, and initializes the router with the corresponding cluster centroids.This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data structure.In addition, we introduce an Expert-Ensemble Self-Distillation loss that regularizes training by guiding uncertain routing with stable predictions from an ensemble teacher.Applied to CLIP ViT-B/16 and ViT-B/32 models, Cluster-aware Upcycling achieves consistent improvements over standard upcycling across zero-shot and few-shot benchmarks, and produces more diverse and disentangled expert representations.