DarkAct: A RGB-Thermal Dataset and Fusion Framework for Multimodal Low-Light Action Recognition
Abstract
Human action recognition (HAR) in low-light environments remains challenging due to degraded visibility, illumination variance, and loss of appearance cues. We introduce DarkAct, a large-scale and high-quality RGB–thermal video dataset purpose-built for multimodal action recognition under low illumination. DarkAct contains 12,778 paired RGB–thermal videos covering 27 human actions across diverse viewpoints and scenes, offering a novel and comprehensive benchmark for understanding human actions in dark environments. We conduct extensive experiments on DarkAct, systematically benchmarking unimodal HAR models, multimodal fusion frameworks, and vision-language foundation models. Their limited performance on DarkAct underscores the urgent need for more robust perception systems under adverse illumination. To address this, we propose DarkAct-Net, an RGB–thermal fusion framework that enhances human-centric representation and achieves adaptive cross-modal fusion, enabling robust and fine-grained action recognition across diverse lighting conditions. All dataset and code will be publicly released.