VOST: Video Object Segmentation under Transformations



VOST is a semi-supervised video object segmentation benchmark that focuses on complex object transformations. Differently from existing datasets, objects in VOST are broken, torn and molded into new shapes, dramatically changing their overall appearance. As our experiments demonstrate, this presents a major challenge for the mainstream, appearance-centric VOS methods. The dataset consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long on average and densely labeled with instance masks. A careful, multi-step approach is adopted to ensure that these videos focus on complex transformations, capturing their full temporal extent. Below, we provide a few key statistics of the dataset.

Baseline Results

The video below shows the outputs of the AOT+ baseline from our paper on a few videos from the validation and test sets of VOST. AOT+ is an extension of AOT which improves its spatio-temporal modeling capacity. However, this model still largely relies on static appearance cues and struggles with complex transformations.


Pavel Tokmakov, Jie Li, Adrien Gaidon.
Breaking the “Object” in Video Object Segmentation.
CVPR 2023.


Email: support@vostdataset.org