A Differentiable Parallel Sampler for Efficient Video Classification

Wang, X; Zhu, L; Wu, F; Yang, Y

A Differentiable Parallel Sampler for Efficient Video Classification

Wang, X Zhu, L Wu, F Yang, Y

Permalink

Publisher:: ASSOC COMPUTING MACHINERY
Publication Type:: Journal Article
Citation:: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19, (3)
Issue Date:: 2023-05

Closed Access

	Filename	Description	Size
	A Differentiable Parallel Sampler for Efficient Video Classification.pdf	Published version	3.83 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, X
dc.contributor.author	Zhu, L
dc.contributor.author	Wu, F
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date.accessioned	2024-03-15T04:14:56Z
dc.date.available	2024-03-15T04:14:56Z
dc.date.issued	2023-05
dc.identifier.citation	ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19, (3)
dc.identifier.issn	1551-6857
dc.identifier.issn	1551-6865
dc.identifier.uri	http://hdl.handle.net/10453/176774
dc.description.abstract	<jats:p>It is crucial to sample a small portion of relevant frames for efficient video classification. The existing methods mainly develop hand-designed sampling strategies or learn sequential selection policies. However, there are two challenges to be solved. First, hand-designed sampling strategies are intrinsically non-adaptive to different video backbones. Second, sequential frame selection policies ignore temporal relations among all video frames. The sequential selection process also hinders the application of these video samplers in speed-critical systems. In this article, we propose a differentiable parallel video sampling network (PSN) to tackle the aforementioned challenges, First, we optimize the video sampler with a differentiable surrogate loss, allowing to dynamically learn the sampler with the cooperation from the video classification model. Our sampler considers the feedback from all frames jointly, eliminating the learning difficulties of sequential decision making. The learning process is fully gradient-based, making the sampler be learned efficiently. Our video sampler can assess a set of frames swiftly and determine the importance of each frame in parallel. Second, we propose to model the inter-relation among contextual frames, which encourages the sampler to select frames based on a comprehensive inspection of the entire video. We observe that a simple context relation mining instantiation would significantly improve the classification performance. The experimental results on three standard video recognition benchmarks demonstrate the efficacy and efficiency of our framework.</jats:p>
dc.language	English
dc.publisher	ASSOC COMPUTING MACHINERY
dc.relation.ispartof	ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS
dc.relation.isbasedon	10.1145/3569584
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0803 Computer Software, 0805 Distributed Computing, 0806 Information Systems
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	4603 Computer vision and multimedia computation
dc.subject.classification	4606 Distributed computing and systems software
dc.subject.classification	4607 Graphics, augmented reality and games
dc.title	A Differentiable Parallel Sampler for Efficient Video Classification
dc.type	Journal Article
utslib.citation.volume	19
utslib.for	0803 Computer Software
utslib.for	0805 Distributed Computing
utslib.for	0806 Information Systems
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
dc.date.updated	2024-03-15T04:14:50Z
pubs.issue	3
pubs.publication-status	Published
pubs.volume	19
utslib.citation.issue	3

Abstract:

It is crucial to sample a small portion of relevant frames for efficient video classification. The existing methods mainly develop hand-designed sampling strategies or learn sequential selection policies. However, there are two challenges to be solved. First, hand-designed sampling strategies are intrinsically non-adaptive to different video backbones. Second, sequential frame selection policies ignore temporal relations among all video frames. The sequential selection process also hinders the application of these video samplers in speed-critical systems. In this article, we propose a differentiable parallel video sampling network (PSN) to tackle the aforementioned challenges, First, we optimize the video sampler with a differentiable surrogate loss, allowing to dynamically learn the sampler with the cooperation from the video classification model. Our sampler considers the feedback from all frames jointly, eliminating the learning difficulties of sequential decision making. The learning process is fully gradient-based, making the sampler be learned efficiently. Our video sampler can assess a set of frames swiftly and determine the importance of each frame in parallel. Second, we propose to model the inter-relation among contextual frames, which encourages the sampler to select frames based on a comprehensive inspection of the entire video. We observe that a simple context relation mining instantiation would significantly improve the classification performance. The experimental results on three standard video recognition benchmarks demonstrate the efficacy and efficiency of our framework.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/176774