ZSTAD: Zero-Shot Temporal Activity Detection

Zhang, L; Chang, X; Liu, J; Luo, M; Wang, S; Ge, Z; Hauptmann, A

ZSTAD: Zero-Shot Temporal Activity Detection

Zhang, L Chang, X

Liu, J Luo, M Wang, S Ge, Z Hauptmann, A

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020, 00, pp. 876-885
Issue Date:: 2020-01-01

Closed Access

	Filename	Description	Size
	Zhang_ZSTAD_Zero-Shot_Temporal_Activity_Detection_CVPR_2020_paper.pdf	Published version	792.18 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, L
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Liu, J
dc.contributor.author	Luo, M
dc.contributor.author	Wang, S
dc.contributor.author	Ge, Z
dc.contributor.author	Hauptmann, A
dc.date	2020-06-14
dc.date.accessioned	2023-03-31T10:21:15Z
dc.date.available	2023-03-31T10:21:15Z
dc.date.issued	2020-01-01
dc.identifier.citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020, 00, pp. 876-885
dc.identifier.issn	1063-6919
dc.identifier.uri	http://hdl.handle.net/10453/168982
dc.description.abstract	An integral part of video analysis and surveillance is temporal activity detection, which means to simultaneously recognize and localize activities in long untrimmed videos. Currently, the most effective methods of temporal activity detection are based on deep learning, and they typically perform very well with large scale annotated videos for training. However, these methods are limited in real applications due to the unavailable videos about certain activity classes and the time-consuming data annotation. To solve this challenging problem, we propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected. We design an end-To-end deep network based on R-C3D as the architecture for this solution. The proposed network is optimized with an innovative loss function that considers the embeddings of activity labels and their super-classes while learning the common semantics of seen and unseen activities. Experiments on both the THUMOS'14 and the Charades datasets show promising performance in terms of detecting unseen activities.
dc.language	en
dc.publisher	IEEE
dc.relation	http://purl.org/au-research/grants/arc/DE190100626
dc.relation.ispartof	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
dc.relation.ispartof	IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
dc.relation.ispartofseries	IEEE Conference on Computer Vision and Pattern Recognition
dc.relation.isbasedon	10.1109/CVPR42600.2020.00096
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	ZSTAD: Zero-Shot Temporal Activity Detection
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	ELECTR NETWORK
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2023-03-31T10:21:14Z
pubs.finish-date	2020-06-19
pubs.publication-status	Published
pubs.start-date	2020-06-14
pubs.volume	00

Abstract:

An integral part of video analysis and surveillance is temporal activity detection, which means to simultaneously recognize and localize activities in long untrimmed videos. Currently, the most effective methods of temporal activity detection are based on deep learning, and they typically perform very well with large scale annotated videos for training. However, these methods are limited in real applications due to the unavailable videos about certain activity classes and the time-consuming data annotation. To solve this challenging problem, we propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected. We design an end-To-end deep network based on R-C3D as the architecture for this solution. The proposed network is optimized with an innovative loss function that considers the embeddings of activity labels and their super-classes while learning the common semantics of seen and unseen activities. Experiments on both the THUMOS'14 and the Charades datasets show promising performance in terms of detecting unseen activities.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/168982