CVPR Poster Reasoning to Attend: Try to Understand How <SEG> Token Works

Poster

Reasoning to Attend: Try to Understand How <SEG> Token Works

Rui Qian · Xin Yin · Dejing Dou

ExHall D Poster #353

[ Abstract ] [ Project Page ]

Sun 15 Jun 8:30 a.m. PDT — 10:30 a.m. PDT

Abstract: Current Large Multimodal Models (LMMs) empowered tasks such as visual grounding and segmentation typically rely on $\texttt{}

t o k e n a s a t e x t p r o m p t i n g t o j o i n t l y o p t i m i z e t h e v i s i o n - l a n g u a g e m o d e l (e . g ., L L a V A) a n d t h e d o w n s t r e a m t a s k - s p e c i f i e d m o d e l (

$token as a text prompting to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specified model ($ \eg

, S A M) . H o w e v e r, w e o b s e r v e t h a t l i t t l e r e s e a r c h h a s l o o k e d i n t o h o w i t w o r k s w h e n m a p p i n g l a n g u a g e v o c a b u l a r y e m b e d d i n g i n t o c o r r e s p o n d i n g v i s i o n c o d e b o o k s p a c e . I n t h i s w o r k, w e f i r s t v i s u a l i z e t h e s i m i l a r i t y m a p s,

$, SAM). However, we observe that little research has looked into how it works when mapping language vocabulary embedding into corresponding vision codebook space. In this work, we first visualize the similarity maps,$ \aka

p s e u d o i m a g e s, w h i c h a r e o b t a i n e d b y c o m p u t i n g t h e d o t p r o d u c t s i m i l a r i t y b e t w e e n t h e

$pseudo images, which are obtained by computing the dot product similarity between the$ \texttt{}

t o k e n a n d t h e i m a g e t o k e n e m b e d i n g s d e r i v e d f r o m t h e l a s t h i d d e n l a y e r i n b o t h L L a V A a n d S A M m o d e l s . I n t r i g u i n g l y, w e h a v e f o u n d t h a t a s t r i k i n g c o n s i s t e n c y h o l d s i n t e r m s o f a c t i v a t i o n r e s p o n s e s i n t h e p s e u d o i m a g e s, w h i c h r e v e a l s t h a t w h a t

$token and the image token embedings derived from the last hidden layer in both LLaVA and SAM models. Intriguingly, we have found that a striking consistency holds in terms of activation responses in the pseudo images, which reveals that what$ \texttt{}

t o k e n c o n t r i b u t e s t o i s t h e s e m a n t i c c o r r e s p o n d e n c e s f r o m i m a g e - t e x t p a i r s . S p e c i f i c a l l y,

$token contributes to is the semantic correspondences from image-text pairs. Specifically,$ \texttt{}

t o k e n, a p l a c e h o l d e r e x p a n d e d i n t e x t v o c a b u l a r y, e x t e n s i v e l y q u e r i e s w i t h i n i n d i v i d u a l t o k e n i z e d i m a g e p a t c h e s t o m a p t h e s e m a n t i c s o f a n o b j e c t f r o m t e x t t o t h e p a i r e d i m a g e w h i l e t h e L a r g e L a n g u a g e M o d e l s (L L M s) i s b e i n g f i n e t i n e d . U p o n a b o v e f i n d i n g s, w e p r e s e n t R E A D, w h i c h f a c i l i t a t e s L M M s^{'} r e s i l i e n t

$token, a placeholder expanded in text vocabulary, extensively queries within individual tokenized image patches to map the semantics of an object from text to the paired image while the Large Language Models (LLMs) is being fine tined. Upon above findings, we present READ, which facilitates LMMs' resilient$ \textbf{REA}

s o n i n g c a p a b i l i t y o f w h e r e t o a t t e n D u n d e r t h e g u i d a n c e o f h i g h l y a c t i v a t e d p o i n t s b o r r o w e d f r o m p s e u d o i m a g e s . R e m a r k a b l y, R E A D f e a t u r e s a n i n t u i t i v e d e s i g n, S i m i l a r i t y a s P o i n t s m o d u l e (S a s P), w h i c h c a n b e s e a m l e s s l y a p p l i e d t o e x i s t i n g

$soning capability of where to atten\textbf{D} under the guidance of highly activated points borrowed from pseudo images. Remarkably, READ features an intuitive design, Similarity as Points module (SasP), which can be seamlessly applied to existing$ \texttt{}

- l i k e p a r a d i g m s w i t h n e g l i g i b l e o v e r h e a d s i n a p l u g - a n d - p l a y f a s h i o n . A l s o, e x t e n s i v e e x p e r i m e n t s h a v e b e e n c o n d u c t e d o n h i g h l y c h a l l e n g i n g r e a s o n i n g s e g m e n t a t i o n d a t a s e t a n d w i d e l y u s e d R e f C O C O (+ / g) r e f e r r i n g s e g m e n t a t i o n d a t a s e t . T o v a l i d a t e w h e t h e r R E A D s u f f e r s f r o m c a t a s t r o p h i c f o r g e t t i n g o f p r e v i o u s s k i l l s a f t e r f i n e - t u n i n g, a s o b s e r v e d i n p r i o r w o r k s (

$-like paradigms with negligible overheads in a plug-and-play fashion. Also, extensive experiments have been conducted on highly challenging reasoning segmentation dataset and widely used RefCOCO(+/g) referring segmentation dataset. To validate whether READ suffers from catastrophic forgetting of previous skills after fine-tuning, as observed in prior works ($ \eg$, LISA), we further assess its generation ability on FP-RefCOCO(+/g) dataset. All code, models will be publicly available.

Live content is unavailable. Log in and register to view live content