TY - JOUR
T1 - Video Object Segmentation without Temporal Information
AU - Maninis, Kevis Kokitsi
AU - Caelles, Sergi
AU - Chen, Yuhua
AU - Pont-Tuset, Jordi
AU - Leal-Taixe, Laura
AU - Cremers, Daniel
AU - Van Gool, Luc
N1 - Publisher Copyright:
© 1979-2012 IEEE.
PY - 2019/6/1
Y1 - 2019/6/1
N2 - Video Object Segmentation, and video processing in general, has been historically dominated by methods that rely on the temporal consistency and redundancy in consecutive video frames. When the temporal smoothness is suddenly broken, such as when an object is occluded, or some frames are missing in a sequence, the result of these methods can deteriorate significantly. This paper explores the orthogonal approach of processing each frame independently, i.e., disregarding the temporal information. In particular, it tackles the task of semi-supervised video object segmentation: the separation of an object from the background in a video, given its mask in the first frame. We present Semantic One-Shot Video Object Segmentation (OSVOSS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one shot). We show that instance-level semantic information, when combined effectively, can dramatically improve the results of our previous method, OSVOS. We perform experiments on two recent single-object video segmentation databases, which show that OSVOSS is both the fastest and most accurate method in the state of the art. Experiments on multi-object video segmentation show that OSVOSS obtains competitive results.
AB - Video Object Segmentation, and video processing in general, has been historically dominated by methods that rely on the temporal consistency and redundancy in consecutive video frames. When the temporal smoothness is suddenly broken, such as when an object is occluded, or some frames are missing in a sequence, the result of these methods can deteriorate significantly. This paper explores the orthogonal approach of processing each frame independently, i.e., disregarding the temporal information. In particular, it tackles the task of semi-supervised video object segmentation: the separation of an object from the background in a video, given its mask in the first frame. We present Semantic One-Shot Video Object Segmentation (OSVOSS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one shot). We show that instance-level semantic information, when combined effectively, can dramatically improve the results of our previous method, OSVOS. We perform experiments on two recent single-object video segmentation databases, which show that OSVOSS is both the fastest and most accurate method in the state of the art. Experiments on multi-object video segmentation show that OSVOSS obtains competitive results.
KW - Video object segmentation
KW - convolutional neural networks
KW - instance segmentation
KW - semantic segmentation
UR - http://www.scopus.com/inward/record.url?scp=85047613775&partnerID=8YFLogxK
U2 - 10.1109/TPAMI.2018.2838670
DO - 10.1109/TPAMI.2018.2838670
M3 - Article
C2 - 29994298
AN - SCOPUS:85047613775
SN - 0162-8828
VL - 41
SP - 1515
EP - 1530
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 6
M1 - 8362936
ER -