Learning Video Object Segmentation from Limited Labelled Data

  • Author / Creator
    Siam, Mennatullah
  • Video object/semantic segmentation has tremendous impact on many robotics applications. Videos of manipulation tasks or driving scenes are more relevant than static images. However, the focus of current video semantic segmentation work is on learning from large-scale datasets. Deep learning methods are highly data dependant, and require large amount of data to perform accurately. Manual annotation of large-scale video object/semantic segmentation benchmarks is labour intensive, and inefficient in terms of cost. Large companies and universities in first world countries have financed currently available benchmarks. The expense and massive computing needs and annotation cost creates a barrier to use deep learning with large-scale labelled data in e.g. developing countries. Thus, we focus on few-shot object segmentation which studies how to learn the segmentation of novel classes from few labelled sampled. Then we study its overlap with the video object segmentation task as a means to address the above problems. We present a thorough investigation of the shared challenges, assumption and solutions among both tasks. Throughout the thesis contributions we mainly focus on metric learning approaches or what is also termed as learning to compare. We start with few-shot object segmentation and solve two main issues. The first issue we address is proposing a single branch method unlike previous methods that used two branches. We are inspired by cosine classifiers and propose a novel multi-resolution masked weight imprinting to generate the weights of the final segmentation layer for novel classes. The second issue we address is the use of a single vector representation to guide the segmentation of novel classes which loses detailed information necessary for the segmentation task. We propose a co-attention mechanism with semantic conditioning to improve the interaction among the test (query set) and training (support set) data during few-shot inference. The semantic conditioning as well alleviates the need for pixel-level annotations for the few training data and rather depend on image-level labels. We then transition to focus on video related tasks and formalize the task of video class agnostic segmentation that benefits from the overlap of few-shot and video object segmentation. We propose two formulations for the problem which focus on segmenting objects in a class agnostic manner and show applications in both autonomous driving and robot manipulation. The first formulation poses the problem as a motion segmentation problem, where we propose the first motion segmentation using deep learning in autonomous driving literature. We further provide KITTI-MoSeg dataset with motion segmentation annotations. Then we extend the work to incorporate motion instance labels along with increased number of categories to push the trained models to generalize to unknown moving objects. The second formulation as an open-set segmentation problem can handle both static and moving objects. We propose a novel contrastive learning approach with semantic and temporal guidance to improve the discrimination among known and unknown objects, and ensure temporal consistency. We further provide scenarios in the Carla simulation environment to motivate the reasons behind the need of such a formulation. Finally, we propose a motion adaptation mechanism for video class agnostic segmentation based on motion for an efficient inference.

  • Subjects / Keywords
  • Graduation date
    Fall 2021
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.