Building a Precise Video Language with Human–AI Oversight
Abstract
Video–language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, supported by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce a critique-based human–AI (CHAI) oversight framework, where trained human experts provide correctional critiques to revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for fine-tuning, improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through standard SFT, offline RL (DPO), online RL (GSPO), and inference-time scaling. With modest expert supervision, the resulting system outperforms even closed-source models such as Gemini-2.5-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of over 400 words, achieving finer control over camera motion, angle, lens, perspectives, and shot composition. Overall, our results show that precise specification and human–AI oversight are key to achieving professional-level video understanding and generation.