Continual Distillation of Teachers from Different Domains
Abstract
Deep learning models continue to scale, with some requiring more storage than numerous datasets. We introduce a new paradigm: Continual Distillation (CD), where a student learns sequentially from a stream of teacher models without retaining earlier teachers. CD faces two challenges: teacher training data is unavailable, and teachers have varying expertise. We show that external unlabeled data enables Unseen Knowledge Transfer (UKT), allowing the student to acquire information from domains not present in the training data, while known to the teacher; but also that sequential distillation causes Unseen Knowledge Forgetting (UKF) when transferred knowledge is lost after training on later teachers. To better trade off between UKT and UKF, we propose Self External Data Distillation (SE2D), a method that preserves logits on external data to stabilize learning across heterogeneous teachers. Experiments on multiple benchmarks show that SE2D reduces UKF and improves cross-domain generalization. The code will be released upon acceptance.