D2FANet: Enhancing Video Object Detection with Dual-Domain Feature Aggregation Network
Abstract
Accurately capturing and aggregating spatiotemporal information has become crucial for video object detection. Previous methods mainly perform feature aggregation in the spatiotemporal domain, treating all regions indiscriminately and overlooking both their relative importance and the frequency characteristics that capture periodic motion patterns. This limits the capability of these methods to capture dynamic interactions and adapt to complex scene variations. In this paper, we propose a novel Dual-Domain Feature Aggregation Network (D2FANet) for video object detection, which, to the best of our knowledge, is the first work to introduce the frequency-domain feature aggregation into the video object detection task. By collaboratively modeling spatiotemporal and frequency information, our D2FANet enhances motion awareness and temporal consistency, thereby improving detection accuracy. First, we develop a frequency-domain feature aggregation module that decomposes frame features into high- and low-frequency distributions and reinforces object query representations through aggregating multi-scale frequency features. Second, we design a spatiotemporal-domain feature aggregation module that leverages an importance guidance mechanism to dynamically emphasize regions of different importance and reinforces object query representations via guiding the aggregation of spatiotemporal features. Experiments on the ImageNet VID and EPIC-KITCHENS datasets demonstrate that D2FANet achieves state-of-the-art performance. The code will be made available.