Towards Comprehensive Visual Understanding via Deep Neural Networks

Publication Type:
Thesis
Issue Date:
2025
Full metadata record
Deep neural networks (DNNs) have made significant advancements in visual scene understanding, demonstrating great potential for applications in downstream tasks such as autonomous driving, robotic navigation, and human-computer interaction. Despite these successes, generalization ability remains a major obstacle on the path to comprehensive visual understanding, particularly when dealing with i) diverse scenes, as well as ii) diverse semantic structures within those scenes. Existing work typically requires extensive annotation for different scenes (domains) and separates the understanding of semantic targets into distinct tasks, designing meticulous networks and corresponding optimization for each. This poses challenges from two perspectives: i) generalizing from one domain to another, and ii) generalizing from one task to another. To adapt an existing model to various domains (challenge i)), this thesis proposes a self-supervised learning framework to learn generalizable structural representations, and a multi-task learning framework to extract transferable knowledge from multi-modalities. To enhance a model’s ability to process various semantic structures (challenge ii)), this thesis introduces a holistic disentanglement and modeling for segment targets under an identical framework. Extensive experiments are conducted to verify the effectiveness of the proposed methods on scene understanding tasks, including Unsupervised Domain Adaptation (UDA), Exemplar-guided Video Segmentation (EVS), Video Instance Segmentation (VIS), Video Semantic Segmentation (VSS), Video Panoptic Segmentation (VPS), and Human-Object Interaction Detection (HOI Detection).
Please use this identifier to cite or link to this item: