Skip to yearly menu bar Skip to main content


Poster

Magma: A Foundation Model for Multimodal AI Agents

Jianwei Yang · Reuben Tan · Qianhui Wu · Ruijie Zheng · Baolin Peng · Yongyuan Liang · Yu Gu · Mu Cai · Seonghyeon Ye · Joel Jang · Yuquan Deng · Jianfeng Gao


Abstract:

This paper presents a new foundation model, called Magma, for multimodal AI agents in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that the former not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial intelligence) to complete agentic tasks ranging from UI navigation to robot manipulation. Magma is pre-trained on large amounts of heterogeneous VL datasets, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set of Marks (SoM) and the object movements (e.g., the trace of a robotic arm) in videos are labeled by Trace of Mark (ToM). Evaluation shows that SoM and ToM facilitate acquisition of spatial intelligence from training data. Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are tailored specifically to these tasks. On VL tasks, Magma also compares favorably to popular VL models that are trained on much larger datasets.

Live content is unavailable. Log in and register to view live content