Poster
Visioner: Exploring Knowledge Learning from Raw Videos
Zhongwei Ren · Yunchao Wei · Xun Guo · Yao Zhao · Bingyi Kang · Jiashi Feng · Xiaojie Jin
This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop an autoregressive video generation model, Visioner, trained exclusively on raw video data, and test its knowledge acquisition abilities in video-based Go and robotic control environments. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning extensive knowledge, and (2) the compactness of visual representations significantly enhances learning efficiency. To improve both the efficiency and efficacy of knowledge learning, we introduce the Latent Dynamics Model (LDM). Remarkably, Visioner reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models to be open-sourced for further research.
Live content is unavailable. Log in and register to view live content