Skip to yearly menu bar Skip to main content


Poster

Reasoning to Attend: Try to Understand How [SEG] Token Works

Rui Qian · Xin Yin · Dejing Dou


Abstract: Current Large Multimodal Models (LMMs) empowered tasks such as visual grounding and segmentation typically rely on $\texttt{}tokenasatextpromptingtojointlyoptimizethevisionlanguagemodel(e.g.,LLaVA)andthedownstreamtaskspecifiedmodel(\eg,SAM).However,weobservethatlittleresearchhaslookedintohowitworkswhenmappinglanguagevocabularyembeddingintocorrespondingvisioncodebookspace.Inthiswork,wefirstvisualizethesimilaritymaps,\akapseudoimages,whichareobtainedbycomputingthedotproductsimilaritybetweenthe\texttt{}tokenandtheimagetokenembedingsderivedfromthelasthiddenlayerinbothLLaVAandSAMmodels.Intriguingly,wehavefoundthatastrikingconsistencyholdsintermsofactivationresponsesinthepseudoimages,whichrevealsthatwhat\texttt{}tokencontributestoisthesemanticcorrespondencesfromimagetextpairs.Specifically,\texttt{}token,aplaceholderexpandedintextvocabulary,extensivelyquerieswithinindividualtokenizedimagepatchestomapthesemanticsofanobjectfromtexttothepairedimagewhiletheLargeLanguageModels(LLMs)isbeingfinetined.Uponabovefindings,wepresentREAD,whichfacilitatesLMMsresilient\textbf{REA}soningcapabilityofwheretoattenDundertheguidanceofhighlyactivatedpointsborrowedfrompseudoimages.Remarkably,READfeaturesanintuitivedesign,SimilarityasPointsmodule(SasP),whichcanbeseamlesslyappliedtoexisting\texttt{}likeparadigmswithnegligibleoverheadsinaplugandplayfashion.Also,extensiveexperimentshavebeenconductedonhighlychallengingreasoningsegmentationdatasetandwidelyusedRefCOCO(+/g)referringsegmentationdataset.TovalidatewhetherREADsuffersfromcatastrophicforgettingofpreviousskillsafterfinetuning,asobservedinpriorworks(\eg$, LISA), we further assess its generation ability on FP-RefCOCO(+/g) dataset. All code, models will be publicly available.

Live content is unavailable. Log in and register to view live content