Chain-of-Models Pre-training: Rethinking Training Acceleration of Vision Foundation Models
Abstract
In this paper, we present Chain-of-Models Pre-training (CoM-PT), a novel performance-lossless training acceleration method for vision transformer models. This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, which scales efficiently with the family size. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of parameter size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance comparable to standard individual training while significantly reducing the training cost. Thanks to the property of model chain, we empirically find two compelling phenomena: i) adding smaller models can even decrease the total training cost, and ii) adding medium-sized models incurs only marginal additional training cost. In light of this, our CoM-PT first unlocks the pre-training efficiency that scales favorably with family size, providing large deployment flexibility across various devices. We plan to open-source the code and encourage the community to extend it to more pre-training paradigms.