Poster
Reasoning to Attend: Try to Understand How [SEG] Token Works
Rui Qian · Xin Yin · Dejing Dou
[
Abstract
]
Abstract:
Current Large Multimodal Models (LMMs) empowered tasks such as visual grounding and segmentation typically rely on $\texttt{}tokenasatextpromptingtojointlyoptimizethevision−languagemodel(e.g.,LLaVA)andthedownstreamtask−specifiedmodel(\eg,SAM).However,weobservethatlittleresearchhaslookedintohowitworkswhenmappinglanguagevocabularyembeddingintocorrespondingvisioncodebookspace.Inthiswork,wefirstvisualizethesimilaritymaps,\akapseudoimages,whichareobtainedbycomputingthedotproductsimilaritybetweenthe\texttt{}tokenandtheimagetokenembedingsderivedfromthelasthiddenlayerinbothLLaVAandSAMmodels.Intriguingly,wehavefoundthatastrikingconsistencyholdsintermsofactivationresponsesinthepseudoimages,whichrevealsthatwhat\texttt{}tokencontributestoisthesemanticcorrespondencesfromimage−textpairs.Specifically,\texttt{}token,aplaceholderexpandedintextvocabulary,extensivelyquerieswithinindividualtokenizedimagepatchestomapthesemanticsofanobjectfromtexttothepairedimagewhiletheLargeLanguageModels(LLMs)isbeingfinetined.Uponabovefindings,wepresentREAD,whichfacilitatesLMMs′resilient\textbf{REA}soningcapabilityofwheretoattenDundertheguidanceofhighlyactivatedpointsborrowedfrompseudoimages.Remarkably,READfeaturesanintuitivedesign,SimilarityasPointsmodule(SasP),whichcanbeseamlesslyappliedtoexisting\texttt{}−likeparadigmswithnegligibleoverheadsinaplug−and−playfashion.Also,extensiveexperimentshavebeenconductedonhighlychallengingreasoningsegmentationdatasetandwidelyusedRefCOCO(+/g)referringsegmentationdataset.TovalidatewhetherREADsuffersfromcatastrophicforgettingofpreviousskillsafterfine−tuning,asobservedinpriorworks(\eg$, LISA), we further assess its generation ability on FP-RefCOCO(+/g) dataset. All code, models will be publicly available.
Live content is unavailable. Log in and register to view live content