Skip to yearly menu bar Skip to main content


Learning Transformation-Predictive Representations for Detection and Description of Local Features

Zihao Wang · Chunxu Wu · Yifei Yang · Zhen Li

West Building Exhibit Halls ABC 308


The task of key-points detection and description is to estimate the stable location and discriminative representation of local features, which is essential for image matching. However, either the rough hard positive or negative labels generated from one-to-one correspondences among images bring indistinguishable samples, called pseudo positives or negatives, which act as inconsistent supervisions while learning key-points used for matching. Such pseudo-labeled samples prevent deep neural networks from learning discriminative descriptions for accurate matching. To tackle this challenge, we propose to learn transformation-predictive representations with self-supervised contrastive learning. We maximize the similarity between corresponded views of the same 3D point (landmark) by using none of the negative sample pairs (including true and pseudo negatives) and avoiding collapsing solutions. Then we design a learnable label prediction mechanism to soften the hard positive labels into soft continuous targets. The aggressively updated soft labels extensively deal with the training bottleneck (derived from the label noise of pseudo positives) and make the model can be trained under a stronger augmentation paradigm. Our self-supervised method outperforms the state-of-the-art on the standard image matching benchmarks by noticeable margins and shows excellent generalization capability on multiple downstream tasks.

Chat is not available.