CVPR Poster Text Augmented Correlation Transformer For Few-shot Classification & Segmentation

Poster

Text Augmented Correlation Transformer For Few-shot Classification & Segmentation

Srinivasa Rao Nandam · Sara Atito · Zhenhua Feng · Josef Kittler · Muhammad Awais

ExHall D Poster #412

[ Abstract ] [ Paper PDF ]

Sun 15 Jun 8:30 a.m. PDT — 10:30 a.m. PDT

Abstract: Foundation models like CLIP and ALIGN have transformed few-shot and zero-shot vision applications by fusing visual and textual data, yet the integrative few-shot classification and segmentation (FS-CS) task primarily leverages visual cues, overlooking the potential of textual support. In FS-CS scenarios, ambiguous object boundaries and overlapping classes often hinder model performance, as limited visual data struggles to fully capture high-level semantics. To bridge this gap, we present a novel multi-modal FS-CS framework that integrates textual cues into support data, facilitating enhanced semantic disambiguation and fine-grained segmentation. Our approach first investigates the unique contributions of exclusive text-based support, using only class labels to achieve FS-CS. This strategy alone achieves performance competitive with vision-only methods on FS-CS tasks, underscoring the power of textual cues in few-shot learning. Building on this, we introduce a dual-modal prediction mechanism that synthesizes insights from both textual and visual support sets, yielding robust multi-modal predictions. This integration significantly elevates FS-CS performance, with classification and segmentation improvements of +3.7/6.6\% (1-way 1-shot) and +8.0/6.5\% (2-way 1-shot) on COCO-

20^{i}

, and +2.2/3.8\% (1-way 1-shot) and +4.3/4.0\% (2-way 1-shot) on Pascal-

5^{i}

. Additionally, in weakly supervised FS-CS settings, our method surpasses visual-only benchmarks using textual support exclusively, further enhanced by our dual-modal predictions. By rethinking the role of text in FS-CS, our work establishes new benchmarks for multi-modal few-shot learning and demonstrates the efficacy of textual cues for improving model generalization and segmentation accuracy.

Live content is unavailable. Log in and register to view live content