Evolving Video Analysis: From Object Perception to Holistic Understanding

Han, Mingfei

Evolving Video Analysis: From Object Perception to Holistic Understanding

Han, Mingfei

Permalink

Publication Type:: Thesis
Issue Date:: 2024

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download thesisAdobe PDF (19.24 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Han, Mingfei
dc.date.accessioned	2025-06-04T22:43:38Z
dc.date.available	2025-06-04T22:43:38Z
dc.date.issued	2024
dc.identifier.uri	http://hdl.handle.net/10453/187621
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_US.UTF-8
dc.description.abstract	The domain of video content analysis has experienced rapid advancements due to the proliferation of digital video content and the evolving capabilities of computer vision technologies. Despite these advancements, significant challenges remain in both video object perception and holistic video understanding, which are crucial for applications ranging from autonomous driving to interactive media. This thesis aims to address these challenges by developing innovative methodologies that enhance the accuracy and efficiency of video analysis systems. In the area of video object perception, this research tackles the problem of accurately detecting, categorizing and referring objects within video frames under various conditions. Key contributions include the Hierarchical Video Relation Network (HVR-Net), which utilizes inter-video proposal relations to enhance object detection accuracy, and Progressive Frame-Proposal Mining (PFPM), which leverages sparse annotations to improve detection in a weakly supervised context. Additionally, the Hybrid Temporal-scale Multimodal Learning (HTML) framework is introduced to refine the segmentation of objects based on textual descriptions, effectively bridging the gap between visual content and language inputs. For holistic video understanding, this thesis introduces methodologies and datasets that significantly improve the interpretation of complex video scenes and dynamics. The Dual-AI framework employs dual paths to innovatively combine spatial and temporal data, enhancing the analysis of individual actions and group dynamics for more accurate recognition of complex group activities. Additionally, we have developed specialized methodologies for Portrait Mode Video recognition, optimizing video analysis techniques for the vertical video format commonly found on social media, thereby addressing its unique challenges. Furthermore, the Shot2Story20K dataset establishes a new benchmark for multi-shot video understanding, facilitating detailed narrative synthesis across sequential shots to enrich the storytelling potential of video content analysis. In conclusion, this thesis contributes a suite of methodologies and datasets that enhance both the foundational aspects of video object perception and the broader capabilities of video understanding. These innovations not only address the current limitations of video analysis technologies but also lay the groundwork for future advancements, suggesting a path forward for the integration of even more sophisticated machine learning models into video content analysis systems. Through these efforts, the thesis demonstrates significant progress in making video analysis more robust, adaptable, and context-aware, aligning more closely with human-level perception and interpretation.	en_US.UTF-8
dc.format	Thesis (PhD)
dc.language.iso	en_US	en_US.UTF-8
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/187621/1/thesis.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	© 2024 Mingfei Han
dc.rights	au.edu.uts.lib/cph
dc.title	Evolving Video Analysis: From Object Perception to Holistic Understanding	en_US.UTF-8
dc.type	Thesis
utslib.copyright.status	open_access	*

Abstract:

The domain of video content analysis has experienced rapid advancements due to the proliferation of digital video content and the evolving capabilities of computer vision technologies. Despite these advancements, significant challenges remain in both video object perception and holistic video understanding, which are crucial for applications ranging from autonomous driving to interactive media. This thesis aims to address these challenges by developing innovative methodologies that enhance the accuracy and efficiency of video analysis systems. In the area of video object perception, this research tackles the problem of accurately detecting, categorizing and referring objects within video frames under various conditions. Key contributions include the Hierarchical Video Relation Network (HVR-Net), which utilizes inter-video proposal relations to enhance object detection accuracy, and Progressive Frame-Proposal Mining (PFPM), which leverages sparse annotations to improve detection in a weakly supervised context. Additionally, the Hybrid Temporal-scale Multimodal Learning (HTML) framework is introduced to refine the segmentation of objects based on textual descriptions, effectively bridging the gap between visual content and language inputs. For holistic video understanding, this thesis introduces methodologies and datasets that significantly improve the interpretation of complex video scenes and dynamics. The Dual-AI framework employs dual paths to innovatively combine spatial and temporal data, enhancing the analysis of individual actions and group dynamics for more accurate recognition of complex group activities. Additionally, we have developed specialized methodologies for Portrait Mode Video recognition, optimizing video analysis techniques for the vertical video format commonly found on social media, thereby addressing its unique challenges. Furthermore, the Shot2Story20K dataset establishes a new benchmark for multi-shot video understanding, facilitating detailed narrative synthesis across sequential shots to enrich the storytelling potential of video content analysis. In conclusion, this thesis contributes a suite of methodologies and datasets that enhance both the foundational aspects of video object perception and the broader capabilities of video understanding. These innovations not only address the current limitations of video analysis technologies but also lay the groundwork for future advancements, suggesting a path forward for the integration of even more sophisticated machine learning models into video content analysis systems. Through these efforts, the thesis demonstrates significant progress in making video analysis more robust, adaptable, and context-aware, aligning more closely with human-level perception and interpretation.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/187621