Generalizable Video Quality Assessment via Weak-to-Strong Learning
Abstract
Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception, serving as a fundamental tool for quantifying quality degradation across video processing workflows. The dominant VQA paradigm relies on supervised training with human-labeled datasets, which, despite substantial progress, still suffers from poor generalization to unseen video content. In this work, we explore \textbf{weak-to-strong (W2S) learning} as a new paradigm for advancing VQA without reliance on human-labeled datasets. We first provide empirical evidence that a straightforward W2S strategy allows a strong student model to not only match its weak teacher on in-domain benchmarks but also surpass it on out-of-distribution (OOD) benchmarks, revealing a \textbf{distinct weak-to-strong effect in VQA}. Building on this insight, we propose a novel framework that enhances W2S learning from two aspects: (1) \textbf{integrating homogeneous and heterogeneous supervision signals} from diverse VQA teachers---including off-the-shelf VQA models and synthetic distortion simulators---via a learn-to-rank formulation, and (2) \textbf{iterative W2S training}, where each strong student is recycled as the teacher in subsequent cycles, progressively focusing on challenging cases. Extensive experiments show that our method achieves state-of-the-art results across both in-domain and OOD benchmarks, with especially strong gains in OOD scenarios. Our findings highlight W2S learning as a principled route to break annotation barriers and achieve scalable generalization in video quality assessment.