Poster Fri, Jun 5, 2026 • 9:45 AM – 11:45 AM PDT ExHall A-F 259

TRCoRSurg: Temporal-Relational Co-Reasoning for Surgical Video Triplet Recognition

Fang Li ⋅ Shihao Zou ⋅ Weixin Si ⋅ Yang Gao ⋅ Shuai Li ⋅ Aimin Hao

Abstract

Understanding complex surgical scenes requires recognizing multiple interdependent entities—such as instruments, actions, and targets—and maintaining their relational consistency across time. Existing surgical triplet recognition methods struggle to jointly model intra-frame label dependencies and inter-frame temporal semantics in a unified manner. To address these limitations, we propose a unified framework that integrates spatial, relational, and temporal cues for robust surgical triplet recognition. Specifically, class-specific spatial priors are first extracted through a multi-scale encoder. Then, these priors are refined by a Label Correlation Modeling module with multi-scale class activation map-guided relational extraction (MS-CAMRE), enabling the model to capture both static co-occurrence and dynamic contextual dependencies among triplet components. Furthermore, a Bidirectional Temporal–Relational Fusion Attention (BTRFA) module harmonizes temporal and relational representations to achieve coherent temporal reasoning. We also introduce a new evaluation metric, the Triplet Consistency Error Rate (TCER), which quantitatively measures the model’s capability to preserve causal and semantic consistency across triplets. Extensive experiments on the CholecT45 and ProStaTD datasets show that our method achieves state-of-the-art (SOTA) performance, improving $AP_{IVT}$ by 5.1\% and 7.8\%, respectively. Moreover, on the TCER metric, our approach yields over 36\% and 25\% relative reductions on the two datasets, respectively, underscoring the effectiveness of our framework in temporal–relational co-reasoning.