Skip to yearly menu bar Skip to main content


Unifying Automatic and Interactive Matting with Pretrained ViTs

Zixuan Ye · Wenze Liu · He Guo · Yujia Liang · Chaoyi Hong · Hao Lu · Zhiguo Cao

Arch 4A-E Poster #140
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT


Automatic and interactive matting largely improve image matting by respectively alleviating the need for auxiliary input and enabling object selection. Due to different settings on whether prompts exist, they either suffer from weakness in instance completeness or region details. Also, when dealing with different scenarios, directly switching between the two matting models introduces inconvenience and higher workload. Therefore, we wonder whether we can alleviate the limitations of both settings while achieving unification to facilitate more convenient use. Our key idea is to offer saliency guidance for automatic mode to enable its attention to detailed regions, and also refine the instance completeness in interactive mode by replacing the binary mask guidance with a more probabilistic form. With different guidance for each mode, we can achieve unification through adaptable guidance, defined as saliency information in automatic mode and user cue for interactive one. It is instantiated as candidate feature in our method, an automatic switch for class token in pretrained ViTs and average feature of user prompts, controlled by the existence of user prompts. Then we use the candidate feature to generate a probabilistic similarity map as the guidance to alleviate the over-reliance on binary mask. Extensive experiments show that our method can adapt well to both automatic and interactive scenarios with more light-weight framework. Code available at

Live content is unavailable. Log in and register to view live content