3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
Abstract
Large Language Models are increasingly integrated as the cognitive core of 3D embodied agents to enable complex environmental reasoning. However, these agents tend to inherit the critical flaw of hallucination, often failing to ground their responses to their 3D view. While Visual Contrastive Decoding (VCD) is a powerful training-free method for mitigating hallucinations in 2D image-based models, it has not been adapted to the complex 3D embodied environment. In this paper, we embarked on the ambitious goal of being the very first to bridge this gap by introducing a VCD framework for 3D embodied agents. Our method operates at inference time by generating a "negative" 3D context, not by blurring an image, but by applying novel distortions directly to a 3D scene graph, such as swapping object category labels or noising positional coordinates. We evaluate our approach on standard evaluation benchmarks and find that it consistently outperforms existing models. For example, in the random category of 3D-POPE, our 3D-VCD method reduced Yes-rate from 99.9\% to 75.1\% while simultaneously increasing precision from 50.0\% to 62.2\%. These results demonstrate that our training-free approach effectively curbs hallucination, yielding 3D agents that are significantly more reliable and grounded.