Convolutional Neural Networks Driven by Content Similarity
Abstract
Although convolutional neural networks (CNNs) have continued to evolve in recent years, Transformers have become increasingly popular in the field of computer vision. In this work, we open a new avenue for CNNs, enabling them to aggregate information based on content similarity—an ability analogous to the self-attention mechanism. We innovatively adopt reverse thinking to transform the feature similarity between tokens into relative positional information: specifically, the closer the positions of two tokens are, the higher their feature similarity. This approach allows convolution operations to be indirectly transformed into an aggregation mode driven by content similarity. Experiments show that our proposed model, named Ego, achieves excellent performance across various tasks, underscoring the untapped potential of CNNs. Code and models will be made publicly available.