Interactive Episodic Memory with User Feedback
Abstract
Human memory is often unreliable. We forget where we placed objects, overlook small details, and struggle to recall past events accurately. Episodic Memory with Natural Language Query (EM-NLQ) seeks to overcome these limitations by allowing users to search their past visual experiences, captured through egocentric videos, using natural language questions. While recent models focus on addressing challenges in EM-NLQ like noisy input videos and efficiency, they overlook a key aspect of this task: interactivity. In real scenarios, users have the ability to refine their queries and provide feedback when a model's response is off-target, yet current EM-NLQ methods cannot incorporate or benefit from such feedback. To address this gap, we introduce the first \textit{interactive} EM-NLQ framework, featuring a plug-and-play Feedback ALignment Module (FALM) that empowers existing models to efficiently incorporate user feedback and refine their predictions. Additionally, we introduce the Episodic Memory with Questions and Feedback task (EM-QnF) and new datasets tailored for feedback-based interaction and a lightweight training scheme that eliminates the need for expensive sequential optimization. Our approach, dubbed ReFocus, combines FALM with state-of-the-art EM-NLQ methods to achieve state-of-the-art results on three challenging benchmarks and demonstrates significant improvements in human-based feedback evaluation, bringing EM-NLQ closer to truly interactive and adaptive visual memory systems.