Mask generation dynamically regulates weakly supervised video instance segmentation

Zifen HE; Lin XU; Yinhui ZHANG; Ying HUANG

doi:10.37188/OPE.20233119.2884

Abstract

The training data of fully supervised video instance segmentation networks are highly dependent on accurate mask annotations under high labor and time costs， owing to which intelligent machines are unable to quickly adapt to new scenes. Therefore， a mask generation， dynamically regulated weakly supervised video instance segmentation （WSVIS） network was proposed. First， to overcome the loss of instance activation features caused by the sudden dimension drop of the initial mask prediction layer channel， a multi-level feature fusion module was used to predict the initial instance features through a step-by-step feature reuse strategy and to generate the initial mask by fusing the relative position information. Second， a dynamic regulation mechanism was introduced to establish mask feature dependencies in the channel and spatial dimensions to strengthen the dynamic interaction between the initial predicted mask and instance-aware information. Finally， the network replaces fine mask labeling with the binary color similarity of images， and the bounding box consistency loss and supervised video instance segmentation mask were replaced with bounding box labeling only. Experimental results reveal that on the BoxSet and YT-VIS datasets， the WSVIS network achieves similar segmentation accuracy and segmentation effect as the fully supervised network and can satisfy real-time reasoning， providing theoretical support and an algorithmic basis for intelligent machines to quickly adapt to new scenes to realize real-time environmental perception and understanding.

微信扫一扫：分享

微信扫一扫：分享