TVHighlights: LLM-Guided Human-Free Collaborative Training for Video Highlight Detection in Movies and TV Dramas
Abstract
Video highlight detection aims to identify the most engaging segments in long-form videos, supporting content editing and recommendation, especially for movies and TV dramas. However, existing methods are ill-suited to cinematic content due to its narrative complexity, while the scarcity of annotated data and the high cost of manual labeling further hinder progress. To bridge this gap, we introduce TVHighlights, the first large-scale dataset tailored for video highlight detection in movies and TV dramas, with 1,721 carefully curated videos covering diverse genres. Built on community-driven behaviors, it provides realistic and diverse annotations without human labeling. Based on TVHighlights, we propose LTV-HD: a LLM-guided, human-free collaborative training framework for video highlight detection in cinematic content. LTV-HD operates in two stages: (1) weakly supervised pre-training of a lightweight model using video-level labels, followed by (2) iterative refinement through collaboration between large language models (LLMs) and the lightweight model. LLMs generate noisy clip-level pseudo-labels, which the lightweight model learns from under a noise-robust strategy, and its high-confidence predictions are then fed back to guide the LLM in distilling genre-specific highlight patterns through a self-improving loop. Experiments demonstrate that LTV-HD achieves state-of-the-art performance on TVHighlights, validating its effectiveness in real-world, annotation-free scenarios.