Poster Sun, Jun 7, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 145

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

Apoorv Vyas ⋅ Heng-Jui Chang ⋅ Cheng-Fu Yang ⋅ Po-Yao Huang ⋅ Luya Gao ⋅ Julius Richter ⋅ Sanyuan Chen ⋅ Matthew Le ⋅ Piotr Dollár ⋅ Christoph Feichtenhofer ⋅ Ann Lee ⋅ Wei-Ning Hsu

Paper PDF

Abstract

We introduce Perception Encoder-Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE~\citep{pe}, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio–video, audio–text, and video–text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio–video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects—avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. Our models and code will be available.