Robust Video Scene Understanding:
Tracking and Video Segmentation

A Workshop and Challenge at CVPR 2021

Virtual - Friday 25th June, 2021 - Whole day

Robust Multi-Object Tracking and Segmentation Challenge

We are hosting the RobMOTS Challenge.

Evaluating trackers' ability to work robustly in the real world across 8 different benchmarks:


RobMOTS is the ultimate tracking challenge!

Testing Multi-Object Tracking and Segmentation methods across 8 different benchmarks.


Differences to previous MOTS Challenges include:

 - A combination of 8 different benchmarks for evaluation.

 - Trackers must work 'robustly' (no per benchmark parameters or hyperparameters)

 - Trackers must track objects from all 80 COCO classes across all benchmarks.

 -  Evaluation using the HOTA metrics, ensuring a focus on improving trackers in a meaningful way.


Prizes: Over $3000 USD in prize money.


Submission deadline: June 14th, 2021. 


Results and methods will be presented at our RVSU CVPR'21 Workshop.

Countdown to Deadline:

Detailed Information:



The field of video scene understanding, and particularly Multi-Object Tracking, has exploded in popularity in the last few years, both within the computer vision community, and beyond it, and progress is now being made faster than ever.


However, this research is often siloed into different sub-communities, who compete over the separate benchmarks. It is difficult to tell whether significant progress on one benchmark leads to improvements in general over other benchmarks, or whether such progress was a result of simply overfitting to the datasets at hand.


This is where our workshop comes in. The organisers of eight of the most popular and diverse benchmarks for Multi-Object Tracking and Video Segmentation (MOTChallenge, KITTI, Waymo, YouTube-VIS, DAVIS, BDD100K, TAO and OVIS) have come together, with the goal of moving towards algorithms that work robustly in the real world across a variety of scenarios, and to bring together the disparate community toward this goal.


By improving tracking accuracy and real world robustness, we believe that this will lead to improvements in many applications such as self-driving cars, assisted mobility, factory robotics, video editing, AR, and beyond.



The Task:


The RobMOTS (Robust Multi-Object Tracking and Segmentation) task is to produce a Multi-Object Tracking and Segmentation method that is robust and works well across all 8 benchmarks, while also not having ANY benchmark specific parameters or hyperparameters.


Trackers are required to track objects belonging to all COCO object categories in all of the benchmarks. While some benchmarks only have a subset of these categories annotated (and others have many more categories), methods must still track all of the COCO categories in all benchmarks, because any per-benchmark knowledge is not to be used, including what categories are evaluated. The set of COCO categories is chosen, because this allows the use of easily-accessible and strong pre-trained detectors and segmentors for these classes trained on the COCO dataset, and thus nudge the focus of participants toward trying to improve the temporal association aspect of tracking rather than just detection and classification. Furthermore, it potentially allows trackers to take advantage of the COCO dataset as large-scale training data by generating synthetic video from the annotated images.


Following the larger trend in computer vision of moving away from coarse bounding-boxes and towards pixel-accurate segmentation masks, trackers will be required to submit tracking results as a set of non-overlapping segmentation masks in each image, the same as in previous benchmarks such as MOTS and UVOS. While 2 of the 8 test-set benchmarks (TAO and Waymo) only have ground-truth bounding-boxes, we still require methods to produce segmentation masks, again in the spirit that there should be no difference for a method between benchmarks. Since these bounding boxes are 'modal', they align with the visible segmentation masks and we can still evaluate segmentation mask tracking accuracy based on the alignment with these ground-truth boxes.


We supply a set of the provided detections, by running the strongest COCO detector and segmenter we can find. While methods are not required to use these (and we would love to see fully-end-to-end trackers that don't use these initial detections) these detections will allow many methods to be run easily on our robust benchmark, with as little effort as possible, while also allowing methods to start from a strong and even playing field in terms of detection. We hope this also promotes tracking researchers to re-evaluate their previous methods on our robust benchmark.





 - MOTS-Challenge is MOTS annotations for the MOT17 benchmark, part of the MOTChallenge, which has been the most commonly used MOT benchmark since 2015. It contains crowded scenes of people (the only class annotated), e.g. shopping malls and city squares.


 - KITTI-MOTS is MOTS annotations for KITTI, the first benchmark for evaluating a large range of tasks for autonomous vehicles.  It consists of videos taken from a car-mounted camera as it drives around Karlsruhe, Germany. There are annotations for cars and pedestrians.


 - The DAVIS-Unsupervised dataset is a part of the DAVIS Challenge, one of the first benchmarks for video segmentation, and contains high quality segmentation masks for carefully selected videos which contain objects, animals and people with considerable motion.


 - YouTube-VIS is a large-scale and diverse benchmark for tracking and video segmentation containing video segmentation annotations for forty classes of objects including vehicles, animals and household items.


 - The BDD100K MOTS dataset is MOTS annotations for BDD100K, one of the largest driving video datasets with high environmental, geographic and weather diversity. 


 - Tracking Any Object (TAO) is a new benchmark which contains thousands of videos with tracking annotations for over 800 classes of objects. This is one of the largest and most diverse tracking datasets to date.


 - The Waymo Open Dataset is a recent large-scale benchmark for autonomous driving and consists of 1,150 scenes that each span about 20 seconds. The sequences were recorded in multiple areas of San Francisco, Mountain View, and Phoenix. 


 - OVIS (Occluded Video Instance Segmentation) is a recent video segmentation dataset, which focuses on tracking and segmenting objects in heavily occluded scenes. It contains 25 categories of objects for which occlusions commonly occur.


Submission Format:


Methods need to submit results to a single submission and evaluation server for all of the benchmarks using a common unified data format. The submission format is similar to (but not the same as)  the format used for KITTI-MOTS and MOTS-Challenge. E.g. a .txt file where each line corresponds to a segmentation mask, and contains the mask in cocotools.mask rle format, the frame number, the track id and the class.


More details of the submission format and an example submission can be found on the code site.



Evaluation Metrics:


Trackers will be ranked using the novel HOTA metrics. This allows trackers to be evaluated in a way that hasn't previously been possible with other MOT metrics. For calculating HOTA we will use the TrackEval codebase.


For more information on the HOTA metrics, we highly recommend this blog post. Also check out the HOTA IJCV paper (open access), and the code.


For each of the 8 benchmarks, the HOTA scores are averaged over the classes present using both a direct class average and a weighted average weighted by the number of detections in each class (this detection weighted evaluation is done in a class agnostic way) . The final score for each dataset is a geometric mean of these two. This ensures a fair treatment of class imbalances within datasets. Just averaging over classes would weigh rare classes too highly and tracking common objects well would not be rewarded. Just averaging over detections would weigh common objects too highly, and methods could ignore rarer objects which are also important to track.


The final overall robust HOTA score will simply be the average score across all benchmarks. Equations are below for clarity.

Note, that due to the design of the HOTA metrics, the use of COCO classes, and the strong provided detections, we expect (although we will have to wait for the results to confirm) that the best performing methods will be those that invest the most effort in improving tracking association. This is unlike previous challenges such as the MOTS Challenge at CVPR'20 where methods were most rewarded for improving detection (because of the MOTA metrics), and the TAO Challenge at ECCV'20 where methods were most rewarded for improving classification (because of the class-averaged Track mAP metrics, and the large number of classes). Please read the HOTA blog post and paper for more details.


We will award prizes for 1st, 2nd and 3rd place overall.

There is more than $3000 USD prize money in total.

The prize money distribution is yet to be decided. There may also be additional smaller prizes for various things.

Paper Presentation at the Robust Video Scene Understanding (RVSU) CVPR'21 Workshop:

Shortly after the Challenge closes we will invite participants to submit a short abstract (400 words maximum) of their method.


Together with the results obtained, we will decide which teams are accepted in the Challenge Track of the workshop.  Invited teams will be asked to provide a 4 page (including references and acknowledgements) paper describing their approach. Accepted papers will be self-published on the website of the challenge (not in the official proceedings, although they have the same value).


Those with accepted papers will also be asked to provide a 1 minute spotlight video that will be played live during the workshop, and a static poster to be presented during a live  poster session. The winner and runner up methods will be asked to present a 10 minute oral presentation.

Other considerations:

Each entry must be associated with a team and provide its affiliation.

The best entry of each team will be public in the leaderboard at all times.

We will only allow submissions in the "robust" setting: e.g. no per benchmark parameters, hyperparameters or other knowledge should be used. Although we have no way to check this during the live challenge, we will make our best to detect it a posteriori before the workshop.

We reserve the right to remove any of the entry methods if we suspect cheating or other misconduct.

Each of the datasets used in this challenge are published under their own licences. Participants must comply with the license regulations for each of the datasets. 

Important Dates 2021:

March 3rd:                      Challenge information released.

April 15th:                        Release: training/val/test images + train gt + provided detections + example tracker + eval code.

April 28th:                        Launch: validation + test server.

June 15th:                          Challenge submission deadline.

June 15th:                         Abstract submission deadline.

June 16th:                         Invitation to submit workshop papers sent out.

June 21st:                          Final camera-ready paper, poster, and video submission deadline.

June 25th:                         Workshop, including award ceremony and paper/poster/spotlight/oral presentations.


All deadlines are at the end of the day, Anywhere on Earth

Data links and submission portal:

We release an official training, validation and test set for this RobMOTS challenge. Even though we will release a training set for convenience, methods are free to train on any data they please EXCEPT for those in our validation and test sets.


We will provide a unified submission portal and evaluation server for evaluating over all of the benchmarks. The submission format is the same across all benchmarks. The validation submission portal allows unlimited submissions for tuning methods, and the test submission portal only allows a very limited number of submissions (to prevent overfitting to the test set). Current leaderboards for both are displayed live.

Workshop Sponsors


Main Organizer and

Contact Person:



Jonathon Luiten (RWTH Aachen University)


Copyright © 2020 Jonathon Luiten