Skip to yearly menu bar Skip to main content


DiVAS: Video and Audio Synchronization with Dynamic Frame Rates

Clara Maria Fernandez Labrador · Mertcan Akcay · Eitan Abecassis · Joan Massich · Christopher Schroers

Arch 4A-E Poster #260
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT


Synchronization issues between audio and video are one of the most disturbing quality defects in film production and live broadcasting. Even a discrepancy as short as 45 millisecond can degrade the viewer’s experience enough to warrant manual quality checks over entire movies. In this paper, we study the automatic discovery of such issues. Specifically, we focus on the alignment of lip movements with spoken words, targeting realistic production scenarios which can include background noise and music, intricate head poses, excessive makeup, or scenes with multiple individuals where the speaker is unknown. Our model’s robustness also extends to various media specifications, including different video frame rates and audio sample rates. To address these challenges, we present a model fully based on transformers that encodes face crops or full video frames and raw audio using timestamp information, identifies the speaker and provides highly accurate synchronization predictions much faster than previous methods.

Live content is unavailable. Log in and register to view live content