Poster
VLMs-Guided Representation Distillation for Efficient Vision-Based Reinforcement Learning
Haoran Xu · Peixi Peng · Guang Tan · Yiqian Chang · Luntong Li · Yonghong Tian
Vision-based Reinforcement Learning (VRL) attempts to establish associations between visual inputs and optimal actions through interactions with the environment. Given the high-dimensional and complex nature of visual data, it becomes essential to learn policy upon high-quality state representation. To this end, existing VRL methods primarily rely on interaction-collected data, combined with self-supervised auxiliary tasks. However, two key challenges remain: limited data samples and a lack of task-relevant semantic constraints. To tackle this, we propose \textbf{DGC}, a method that \textbf{d}istills \textbf{g}uidance from Visual Language Models (VLMs) alongside self-supervised learning into a \textbf{c}ompact VRL agent. Notably, we leverage the state representation capabilities of VLMs, rather than their decision-making abilities. Within DGC, a novel prompting-reasoning pipeline is designed to convert historical observations and actions into usable supervision signals, enabling semantic understanding within the compact visual encoder. By leveraging these distilled semantic representations, the VRL agent achieves significant improvements in the sample efficiency. Extensive experiments on the Carla benchmark demonstrate our state-of-the-art performance. The source code is available in the supplementary material.
Live content is unavailable. Log in and register to view live content