Poster Sun, Jun 7, 2026 • 2:30 PM – 4:30 PM PDT ExHall A 402

Self-Critical Distillation Network for Video-based Commonsense Captioning

Mengqi Yuan ⋅ Gengyun Jia ⋅ Bing-Kun Bao

Abstract

Video-based commonsense captioning aims to generate captions for the video content while providing multiple commonsense about the underlying events. Existing approaches rely on constructing a "video $\rightarrow$ content caption $\rightarrow$ commonsense" reasoning chain, which generates visually ungrounded commonsense and neglects inter-category commonsense correlations. Firstly, the existing reasoning chain induces the model's excessive reliance on content caption when generating commonsense, resulting in generic outputs with limited visual relevance. Secondly, the reasoning chain adopts multiple isolated decoders for commonsense generation, which fails to leverage the correlations between different categories of commonsense. To address these limitations, we introduce a novel self-critical distillation network (SCD-Net), which optimizes the reasoning chain by enhancing visual reasoning and establishing inter-category commonsense correlations. Specifically, on the one hand, we introduce self-critical learning and design a reward function to allow the model to refine its output. This mechanism incentivizes the model to maximize the utilization of visual information, thus improving the model's capacity for visual comprehension. On the other hand, we propose a joint reasoning distillation framework that fosters mutual inference among diverse commonsense categories. In this framework, we incorporate the cascaded decoder and knowledge distillation strategy to facilitate inter-category commonsense knowledge transfer while maintaining the fairness of the testing. Our experiments on the large-scale Video-to-Commonsense dataset demonstrate that our approach performs favorably against state-of-the-art methods. The code will be released soon.