Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning
Abstract
Streaming dense video captioning requires real-time processing of continuous visual input while determining precisely when and what to caption. Current approaches primarily focus on designing complex external memory mechanisms, failing to leverage Large Multimodal Models' (LMMs) inherent long-context capabilities. Moreover, existing methods employing threshold-based caption triggering face a severe Threshold-Gated Discrepancy (TGD) problem, a training-inference mismatch arising from data imbalance, where models predominantly predict silence tokens, requiring thresholds that vary drastically across videos with extremely narrow effective ranges. We introduce Takusen, an asynchronous temporal modeling two-agent framework comprising a Small Multimodal Model (SMM) as an Oracle agent and an LMM as a Listener agent. The Oracle agent processes sparse video inputs at an accelerated rate to detect event boundaries, while the Listener agent processes dense inputs to generate accurate captions when prompted by the Oracle's signals. This architecture eliminates threshold dependencies by fundamentally changing how silence/generation decisions are made, resolving the TGD problem. To enhance robustness against boundary prediction instabilities, we integrate uniformly distributed fixed decoding points with Oracle-predicted boundaries. Experiments on ActivityNet Captions and YouCook2 datasets demonstrate that Takusen achieves state-of-the-art performance with a simpler and more efficient design that balances temporal sensitivity with descriptive accuracy.