Matching Every Pair to Track Every Point: PairFormer for All-Pairs Tracking and Video Trajectory Fields
Guangyang Wu ⋅ Youran Ding ⋅ Xinyu Che ⋅ BENYUAN SUN ⋅ Yi Yang ⋅ Xiaohong Liu
Abstract
Tracking-any-point (TAP) answers query-conditioned correspondence but leaves the dense, all-pairs structure of a video implicit. We formulate All-Pairs Tracking (APT): given a video, predict dense displacement and visibility for every source-target frame pair, from which per-pixel trajectories can be read out.To this end, we propose PairFormer, a feed-forward transformer that addresses APT in a single pass. A spatio-temporal patch encoder computes temporally conditioned features for all frames. CorrBank constructs a learnable correlation memory for each frame pair to obtain pairwise motion tokens. A broadcast motion mixer aggregates information across space and time and refines these tokens with global context. A trajectory head then predicts full-resolution displacement for each pair, yielding a coherent all-pairs trajectory field.To support APT at scale, we develop PAIRender, a data platform that synthesizes photo-realistic dynamic scenes with dense annotations. From PAIRender we derive a training set ($\pi$-R10K) and a benchmark (APT-Bench) with an all-to-all evaluation protocol. Experiments show that PairFormer achieves strong performance on APT-Bench and competitive results on standard TAP benchmarks. Code and dataset will be released upon publication.
Successful Page Load