Unstitching the Chimera: Frame-Level Risk and Train-Free Mitigation for Video Hallucination
Songyuan Yang ⋅ Guijian Tang ⋅ Kun Hu ⋅ Haotian Wang ⋅ Shixuan Liu ⋅ Wenjing Yang ⋅ Long Lan ⋅ Huibin Tan
Abstract
Hallucination limits the reliability of multimodal large language models (MLLMs), and it is particularly damaging in video where errors manifest as distorted narratives rather than single-frame mistakes. We introduce a frame-first study of **Chimera Hallucination**: model stitches visual segments that exist in space and time but do not belong to the same event chain, producing a spurious continuous story. We introduce **CH-Risk**, a single-forward, reference-free risk estimate tailored to this failure mode. CH-Risk combines two complementary signals: $SegCoverage@\alpha (\mathrm{SCR}@\alpha\)$ measures how many event segments are needed to cover most text-to-frame support, exposing long-range stitching; Alignment with Early Temporal Pathway (AETP) measures rank consistency between support and the temporal pathway formed in early–middle layers, exposing stage mismatch. To turn risk into correction, we further propose **CH-M(itigation)**, a train-free two-stage intervention. Segment-aligned Stage-Aligned Frame Routing (sSAFR) re-weights frames before the mid-layer softmax to route attention toward a small set of pathway-aligned segments. Residual Token Calibration (RTC) then stabilizes token usage within selected segments. Extensive experiments across 9 benchmarks and 6 VideoLLMs show that CH-Risk can predict Chimera and that CH-M consistently reduce hallucination and improves task accuracy with negligible overhead (sub-5\% latency, sub-2.5\% memory, \$\approx$1\% FLOPs).
Successful Page Load