TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction
Abstract
Learning temporal transformations, that is, how visual objects evolve across frames, is a fundamental challenge in video representation learning. Frame-to-frame dynamics involve complex, non-linear, and non-local changes that go far beyond conventional spatial augmentations. We propose TimeBridge, a self-supervised method that combines the joint embedding for video representation with learning temporal transformations by reconstructing in-between frames from only the start and end frames. This formulation encourages the model to infer the temporal evolution bridging the two endpoints, rather than merely encoding static frame representations. Unlike joint-embedding methods that lack explicit transformation modelling or future-prediction objectives that rely on unconstrained extrapolation, TimeBridge learns concrete frame-to-frame dynamics by promoting temporal consistency. We realise this through cross-concatenated class tokens and lightweight decoders, which recombine features from the start and end frames to reconstruct intermediates. TimeBridge achieves new state-of-the-art performance on multiple dense video prediction benchmarks, including 73.5 J&F on DAVIS 2017 video object segmentation, 47.5 mIoU on VIP part propagation.