From Remember to Transfer: Interpretable Open-World Reasoning in MLLMs
Abstract
Multimodal agents, such as JARVIS-1, are rapidly advancing in open-world environments. Their core workflow typically follows a perception–reasoning–action–memory cycle. Existing studies primarily emphasize improving memory representations and storage formats, treating memory mainly as an information repository. However, distilling transferable knowledge from stored experiences remains an important yet underexplored challenge.In real-world settings, structures and patterns tend to recur. If an agent can capture and reuse these latent patterns, it can infer new actionable knowledge from prior experience, enabling more efficient and flexible task execution. To explore this capability, we propose Echo. Echo decomposes knowledge into five explicit dimensions of transferability: structure, attribute, process, function, and interaction. Based on this formulation, Echo leverages In-Context Analogy Learning (ICAL) to effectively retrieve past experiences and generalize them to new tasks.Experiments show that, under a from-scratch learning setting, Echo achieves a 1.3×–1.7× speed-up in object-unlocking tasks. Moreover, Echo exhibits a burst-like chain-unlocking phenomenon, rapidly unlocking multiple similar items within a short time interval. These results demonstrate that robust knowledge transfer, driven by effective utilization of contextual examples, is a highly promising direction for advancing open-world multimodal agents.