Skip to yearly menu bar Skip to main content


AV-RIR: Audio-Visual Room Impulse Response Estimation

Anton Ratnarajah · Sreyan Ghosh · Sonal Kumar · Purva Chiniya · Dinesh Manocha

Arch 4A-E Poster #290
[ ] [ Project Page ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT

Abstract: Accurate estimation of Room Impulse Response (RIR), which captures an environment's acoustic properties, can aid in synthesizing speech as if it were spoken in that environment. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural architecture that effectively captures environment geometry and materials properties and solves speech dereverberation as an auxiliary task. We also propose Geo-Mat features that augment material information into visual cues and CRIP that improves late reverberation components in the estimated RIR via image-to-RIR retrieval by 86\%. Empirical results show that AV-RIR quantitatively outperforms previous audio-only and visual-only approaches by achieving 36\% - 63\% improvement across various acoustic metrics in RIR estimation. Additionally, it also achieves higher preference scores in human evaluation. As an auxiliary benefit, dereverbed speech from AV-RIR shows competitive performance with the state-of-the-art in a variety of spoken language processing tasks and outperforms $T_{60}$ error score in the real-world AVSpeech dataset. Code and qualitative examples of both synthesized reverberant speech and enhanced speech can be found in the supplementary.

Live content is unavailable. Log in and register to view live content