Evolving Video Analysis: From Object Perception to Holistic Understanding

Publication Type:
Thesis
Issue Date:
2024
Full metadata record
The domain of video content analysis has experienced rapid advancements due to the proliferation of digital video content and the evolving capabilities of computer vision technologies. Despite these advancements, significant challenges remain in both video object perception and holistic video understanding, which are crucial for applications ranging from autonomous driving to interactive media. This thesis aims to address these challenges by developing innovative methodologies that enhance the accuracy and efficiency of video analysis systems. In the area of video object perception, this research tackles the problem of accurately detecting, categorizing and referring objects within video frames under various conditions. Key contributions include the Hierarchical Video Relation Network (HVR-Net), which utilizes inter-video proposal relations to enhance object detection accuracy, and Progressive Frame-Proposal Mining (PFPM), which leverages sparse annotations to improve detection in a weakly supervised context. Additionally, the Hybrid Temporal-scale Multimodal Learning (HTML) framework is introduced to refine the segmentation of objects based on textual descriptions, effectively bridging the gap between visual content and language inputs. For holistic video understanding, this thesis introduces methodologies and datasets that significantly improve the interpretation of complex video scenes and dynamics. The Dual-AI framework employs dual paths to innovatively combine spatial and temporal data, enhancing the analysis of individual actions and group dynamics for more accurate recognition of complex group activities. Additionally, we have developed specialized methodologies for Portrait Mode Video recognition, optimizing video analysis techniques for the vertical video format commonly found on social media, thereby addressing its unique challenges. Furthermore, the Shot2Story20K dataset establishes a new benchmark for multi-shot video understanding, facilitating detailed narrative synthesis across sequential shots to enrich the storytelling potential of video content analysis. In conclusion, this thesis contributes a suite of methodologies and datasets that enhance both the foundational aspects of video object perception and the broader capabilities of video understanding. These innovations not only address the current limitations of video analysis technologies but also lay the groundwork for future advancements, suggesting a path forward for the integration of even more sophisticated machine learning models into video content analysis systems. Through these efforts, the thesis demonstrates significant progress in making video analysis more robust, adaptable, and context-aware, aligning more closely with human-level perception and interpretation.
Please use this identifier to cite or link to this item: