Skip to yearly menu bar Skip to main content


Oral Session

Oral Session 2C: Temporal Modeling and Action Recognition

Fri 13 Jun 11 a.m. PDT — 12:30 p.m. PDT
Abstract:
Chat is not available.

Fri 13 June 11:00 - 11:15 PDT

Award Candidate
Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays

Laurie Bose · Piotr Dudek · Jianing Chen

This paper presents a novel approach for joint point-feature detection and tracking, specifically designed for Pixel Processor Array sensors (PPA). Instead of standard pixels, PPA sensors consists of thousands of "pixel-processors", enabling massive parallel computation of visual data at the point of light capture. Our approach performs all computation within these pixel-processors, meaning no raw image data need ever leave the sensor. Instead, sensor output can be reduced to merely the locations of tracked features, and the descriptors of newly initialized features, minimizing data transfer between sensor and external processing. To achieve this we store feature descriptors inside every pixel-processor, adjusting the layout of these descriptors every frame. The PPA's architecture enables us to compute the response of every stored descriptor in parallel. This "response map" is utilized for both detection and tracking of point-features across the pixel-processor array. This approach is very fast, our implementation upon the SCAMP-7 PPA prototype runs at over 3000 FPS (Frames Per Second), tracking point-features reliably even under violent motion. This is the first work performing point-feature detection and tracking entirely "in-pixel".

Fri 13 June 11:15 - 11:30 PDT

Temporally Consistent Object-Centric Learning by Contrasting Slots

Anna Manasyan · Maximilian Seitzer · Filip Radovic · Georg Martius · Andrii Zadaianchuk

Unsupervised object-centric learning from videos is a promising approach to extract structured representations from large, unlabeled collections of videos. To support downstream tasks like autonomous control, these representations must be both compositional and temporally consistent. Existing approaches based on recurrent processing often lack long-term stability across frames because their training objective does not enforce temporal consistency. In this work, we introduce a novel object-level temporal contrastive loss for video object-centric models that explicitly promotes temporal consistency. Our method significantly improves the temporal consistency of the learned object-centric representations, yielding more reliable video decompositions that facilitate challenging downstream tasks such as unsupervised object dynamics prediction. Furthermore, the inductive bias added by our loss strongly improves object discovery, leading to state-of-the-art results on both synthetic and real-world datasets, outperforming even weakly-supervised methods that leverage motion masks as additional cues.

Fri 13 June 11:30 - 11:45 PDT

Temporal Alignment-Free Video Matching for Few-shot Action Recognition

SuBeen Lee · WonJun Moon · Hyun Seok Seong · Jae-Pil Heo

Few-Shot Action Recognition (FSAR) aims to train a model with only a few labeled video instances.A key challenge in FSAR is handling divergent narrative trajectories for precise video matching.While the frame- and tuple-level alignment approaches have been promising, their methods heavily rely on pre-defined and length-dependent alignment units (e.g., frames or tuples), which limits flexibility for actions of varying lengths and speeds. In this work, we introduce a novel TEmporal Alignment-free Matching (TEAM) approach, which eliminates the need for temporal units in action representation and brute-force alignment during matching.Specifically, TEAM represents each video with a fixed set of pattern tokens that capture globally discriminative clues within the video instance regardless of action length or speed, ensuring its flexibility.Furthermore, TEAM is inherently efficient, using token-wise comparisons to measure similarity between videos, unlike existing methods that rely on pairwise comparisons for temporal alignment. Additionally, we propose an adaptation process that identifies and removes common information across novel classes, establishing clear boundaries even between novel categories. Extensive experiments demonstrate the effectiveness of TEAM.

Fri 13 June 11:45 - 12:00 PDT

Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

Zhejun Zhang · Peter Karkus · Maximilian Igl · Wenhao Ding · Yuxiao Chen · Boris Ivanovic · Marco Pavone

Traffic simulation aims to learn a policy for traffic agents that, when unrolled in closed-loop, faithfully recovers the joint distribution of trajectories observed in the real world. Inspired by large language models, tokenized multi-agent policies have recently become the state-of-the-art in traffic simulation. However, they are typically trained through open-loop behavior cloning, and thus suffer from covariate shift when executed in closed-loop during simulation. In this work, we present Closest Among Top-K (CAT-K) rollouts, a simple yet effective closed-loop fine-tuning strategy to mitigate covariate shift. CAT-K fine-tuning only requires existing trajectory data, without reinforcement learning or generative adversarial imitation. Concretely, CAT-K fine-tuning enables a small 7M-parameter tokenized traffic simulation policy to outperform a 102M-parameter model from the same model family, achieving the top spot on the Waymo Sim Agent Challenge leaderboard at the time of submission. Our code will be made publicly available.

Fri 13 June 12:00 - 12:15 PDT

Award Candidate
The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition

Otto Brookes · Maksim Kukushkin · Majid Mirmehdi · Colleen Stephens · Paula Dieguez · Thurston Cleveland Hicks · Sorrel CZ Jones · Kevin C. Lee · Maureen S. McCarthy · Amelia C. Meier · NORMAND E. · Erin G. Wessling · Roman M. Wittig · Kevin Langergraber · Klaus Zuberbühler · Lukas Boesch · Thomas Schmid · Mimi Arandjelovic · Hjalmar S. Kühl · Tilo Burghardt

Computer vision analysis of camera trap video footage is essential for wildlife conservation, as captured behaviours offer some of the earliest indicators of changes in population health. Recently, several high-impact animal behaviour datasets and methods have been introduced to encourage their use; however, the role of behaviour-correlated background information and its significant effect on out-of-distribution generalisation remain unexplored. In response, we present the PanAf-FGBG dataset, featuring 20 hours of wild chimpanzee behaviours, recorded at over 350 individual camera locations. Uniquely, it pairs every video with a chimpanzee (referred to as a foreground video) with a corresponding background video (with no chimpanzee) from the same camera location. We present two views of the dataset: one with overlapping camera locations and one with disjoint locations. This setup enables, for the first time, direct evaluation of in-distribution and out-of-distribution conditions, and for the impact of backgrounds on behaviour recognition models to be quantified. All clips come with rich behavioural annotations and metadata including unique camera IDs and detailed textual scene descriptions. Additionally, we establish several baselines and present a highly effective latent-space normalisation technique that boosts out-of-distribution performance by +5.42\% mAP for convolutional and +3.75\% mAP for transformer-based models. Finally, we provide an in-depth analysis on the role of backgrounds in out-of-distribution behaviour recognition, including the so far unexplored impact of background durations (i.e., the count of background frames within foreground videos). The full dataset, baseline models, and weights will be available at `anonymous'.

Fri 13 June 12:15 - 12:30 PDT

Rethinking Spiking Self-Attention Mechanism: Implementing α-XNOR Similarity Calculation in Spiking Transformers

Yichen Xiao · Shuai Wang · Dehao Zhang · Wenjie Wei · Yimeng Shan · Xiaoli Liu · Yulin Jiang · Malu Zhang

Transformers significantly raise the performance limits across various tasks, spurring research into integrating them into spiking neural networks. However, a notable performance gap remains between existing spiking Transformers and their artificial neural network counterparts. Here, we first analyze the reason for this gap and identify that the dot product ineffectively calculates similarity between spiking Queries (Q) and Keys (K). To address this challenge, we introduce an innovative α-XNOR similarity calculation method tailored for spike trains. α-XNOR similarity redefines the correlation of non-spike pairs as a specific value α, effectively overcoming the limitations of dot-product similarity caused by numerous non-spiking events. Additionally, considering the sparse nature of spike trains where spikes carry more information than non-spikes, the α-XNOR similarity correspondingly highlights the distinct importance of spikes over non-spikes. Extensive experiments demonstrate that our α-XNOR similarity significantly improves performance across different spiking Transformer architectures in various static and neuromorphic datasets. This is the first attempt to develop a spiking self-attention paradigm tailored for the binary characteristics of spike trains.