Towards Comprehensive Visual Understanding via Deep Neural Networks

Chen, Mu

Towards Comprehensive Visual Understanding via Deep Neural Networks

Chen, Mu

Permalink

Publication Type:: Thesis
Issue Date:: 2025

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download thesisAdobe PDF (12.96 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Chen, Mu
dc.date.accessioned	2025-07-25T04:59:51Z
dc.date.available	2025-07-25T04:59:51Z
dc.date.issued	2025
dc.identifier.uri	http://hdl.handle.net/10453/188843
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_US.UTF-8
dc.description.abstract	Deep neural networks (DNNs) have made significant advancements in visual scene understanding, demonstrating great potential for applications in downstream tasks such as autonomous driving, robotic navigation, and human-computer interaction. Despite these successes, generalization ability remains a major obstacle on the path to comprehensive visual understanding, particularly when dealing with i) diverse scenes, as well as ii) diverse semantic structures within those scenes. Existing work typically requires extensive annotation for different scenes (domains) and separates the understanding of semantic targets into distinct tasks, designing meticulous networks and corresponding optimization for each. This poses challenges from two perspectives: i) generalizing from one domain to another, and ii) generalizing from one task to another. To adapt an existing model to various domains (challenge i)), this thesis proposes a self-supervised learning framework to learn generalizable structural representations, and a multi-task learning framework to extract transferable knowledge from multi-modalities. To enhance a model’s ability to process various semantic structures (challenge ii)), this thesis introduces a holistic disentanglement and modeling for segment targets under an identical framework. Extensive experiments are conducted to verify the effectiveness of the proposed methods on scene understanding tasks, including Unsupervised Domain Adaptation (UDA), Exemplar-guided Video Segmentation (EVS), Video Instance Segmentation (VIS), Video Semantic Segmentation (VSS), Video Panoptic Segmentation (VPS), and Human-Object Interaction Detection (HOI Detection).	en_US.UTF-8
dc.format	Thesis (PhD)
dc.language.iso	en_US	en_US.UTF-8
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/188843/1/thesis.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	© 2025 Mu Chen
dc.rights	au.edu.uts.lib/cph
dc.title	Towards Comprehensive Visual Understanding via Deep Neural Networks	en_US.UTF-8
dc.type	Thesis
utslib.copyright.status	open_access	*

Abstract:

Deep neural networks (DNNs) have made significant advancements in visual scene understanding, demonstrating great potential for applications in downstream tasks such as autonomous driving, robotic navigation, and human-computer interaction. Despite these successes, generalization ability remains a major obstacle on the path to comprehensive visual understanding, particularly when dealing with i) diverse scenes, as well as ii) diverse semantic structures within those scenes. Existing work typically requires extensive annotation for different scenes (domains) and separates the understanding of semantic targets into distinct tasks, designing meticulous networks and corresponding optimization for each. This poses challenges from two perspectives: i) generalizing from one domain to another, and ii) generalizing from one task to another. To adapt an existing model to various domains (challenge i)), this thesis proposes a self-supervised learning framework to learn generalizable structural representations, and a multi-task learning framework to extract transferable knowledge from multi-modalities. To enhance a model’s ability to process various semantic structures (challenge ii)), this thesis introduces a holistic disentanglement and modeling for segment targets under an identical framework. Extensive experiments are conducted to verify the effectiveness of the proposed methods on scene understanding tasks, including Unsupervised Domain Adaptation (UDA), Exemplar-guided Video Segmentation (EVS), Video Instance Segmentation (VIS), Video Semantic Segmentation (VSS), Video Panoptic Segmentation (VPS), and Human-Object Interaction Detection (HOI Detection).

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/188843