Skip to yearly menu bar Skip to main content


Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Chaoya Jiang · Haiyang Xu · Mengfan Dong · Jiaxing Chen · Wei Ye · Ming Yan · Qinghao Ye · Ji Zhang · Fei Huang · Shikun Zhang

Arch 4A-E Poster #278
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT


Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinatory text and visual samples closer while pushing way representations of non-hallucinatory and hallucinatory text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66\% /29.5\% improvement over the baseline MiniGPT-4/LLaVA.

Live content is unavailable. Log in and register to view live content