When Transformers Meet Mamba: A Hybrid Transformer-Mamba Network for Video Object Detection
Abstract
Video object detection has gained notable progress with the advent of transformers. While transformers excel at modeling long-range contextual dependencies, the quadratic complexity limits their efficiency in long-sequence processing. In contrast, Mamba offers greater efficiency in modeling long sequences but tends to exhibit relatively limited contextual learning capability compared with transformers, and its application to video object detection remains unexplored. To harness the complementary strengths of transformers and Mamba, we propose a hybrid Transformer-Mamba network for video object detection (TMambaDet), a pioneering framework in this domain that combines the long-range modeling power of transformers with the efficient long-sequence processing capability of Mamba. Our TMambaDet is characterized by three core components: 1) a spatial adaptive deformable transformer encoder to effectively model the long-range dependencies within each frame, enabling intra-frame feature aggregation that substantially improves the spatial feature representations of objects; 2) a temporal cascaded bidirectional Mamba encoder to efficiently capture the long-range dependencies across frames in video sequences with linear complexity, enabling inter-frame feature aggregation that effectively enhances the temporal feature representations of objects; 3) a Mamba entangled transformer decoder to fully explore the interactions between object queries and spatial-temporal features, enabling fine-grained query-feature alignment that effectively enriches the instance-level representations of object queries. We conduct experiments on the ImageNet VID and EPIC-KITCHENS-55 datasets, showing that TMambaDet achieves state-of-the-art results. Codes will be released.