Flow Matching for Multimodal Distributions
Abstract
Visual foundation models play an increasingly important role in the training efficiency of flow-based models by inducing structured latent space through alignments, distillations, adapters, and even replacements of visual encoders. When structured latent space improves training efficiency by lowering the complexity of the target (latent) distribution, the efficiency can be further boosted by a data-adaptive multimodal source (noise) distribution that globally shortens the distance to the target (latent) distribution, and a mode-dependent coupling between source and target samples to move probability mass locally. To this end, we propose an efficient source and coupling co-design algorithm termed Mixture-Modeling Flow Matching (MM-FM). Under linear conditional flow objective and multimodal target assumption, our theoretical results reveal straighter and shorter sampling trajectories and smaller Lipschitz constant for learning complexity relative to isotropic Gaussian with independent coupling. In our ImageNet256x256 experiments with multimodal DINOv2-B latents, we observe superior convergence and state-of-the-art unconditional generation FID=2.74 with autoguidance in only 80 epochs.