Structural–Semantic Perception for Diffusion-Guided Temporal Forgery Localization
Abstract
Temporal Forgery Localization (TFL) is crucial for enhancing the interpretability and accountability of deepfake forensics by precisely pinpointing the manipulated segments.However, existing methods face two limitations: (1) localization precision, where one-shot boundary prediction models fail to rectify inherent initial prediction biases, and temporal emphasis overlooks modality-internal semantic forgery cues, resulting in noise-sensitive localization, and (2) cross-dataset generalization, where fixed-scale temporal receptive fields struggle to accommodate varying manipulation durations across real-world scenarios. To address these challenges, we propose a unified framework based on structural–semantic perception and diffusion-guided refinement. The structural–semantic perception comprises two complementary components: (1) structural perception, which adaptively models manipulation durations across varying temporal spans using a designed scale weight allocation network, and (2) semantic perception, which analyzes the semantic consistency within each modality through intra-modal distillation.In this way, it first suppresses low-quality forgery localization proposals, yielding a structurally and semantically reliable candidate set. Then a diffusion-based regression head further iteratively refines the candidates into precise and temporally coherent boundary trajectories.Extensive experiments on multiple TFL benchmarks demonstrate that our method achieves state-of-the-art performance.