Learning Audio-Visual Correlations From Variational Cross-Modal Generation

Zhu, Y; Wu, Y; Latapie, H; Yang, Y; Yan, Y

Learning Audio-Visual Correlations From Variational Cross-Modal Generation

Zhu, Y Wu, Y

Latapie, H Yang, Y

Yan, Y

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, 00, pp. 4300-4304
Issue Date:: 2021-06-11

Closed Access

	Filename	Description	Size
	Learning_Audio-Visual_Correlations_From_Variational_Cross-Modal_Generation.pdf	Published version	3.07 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhu, Y
dc.contributor.author	Wu, Y https://orcid.org/0000-0002-1680-8253
dc.contributor.author	Latapie, H
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Yan, Y
dc.date	2021-06-06
dc.date.accessioned	2022-07-04T05:20:29Z
dc.date.available	2022-07-04T05:20:29Z
dc.date.issued	2021-06-11
dc.identifier.citation	ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, 00, pp. 4300-4304
dc.identifier.isbn	978-1-7281-7605-5
dc.identifier.issn	1520-6149
dc.identifier.issn	2379-190X
dc.identifier.uri	http://hdl.handle.net/10453/158607
dc.description.abstract	People can easily imagine the potential sound while seeing an event. This natural synchronization between audio and visual signals reveals their intrinsic correlations. To this end, we propose to learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner, the learned correlations can be then readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval. We introduce a novel Variational AutoEncoder (VAE) framework that consists of Multiple encoders and a Shared decoder (MS-VAE) with an additional Wasserstein distance constraint to tackle the problem. Extensive experiments demonstrate that the optimized latent representation of the proposed MS-VAE can effectively learn the audio-visual correlations and can be readily applied in multiple audio-visual downstream tasks to achieve competitive performance even without any given label information during training.
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
dc.relation.ispartof	IEEE International Conference on Acoustics, Speech and Signal Processing
dc.relation.isbasedon	10.1109/icassp39728.2021.9414296
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Learning Audio-Visual Correlations From Variational Cross-Modal Generation
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Toronto, ON, Canada
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2022-07-04T05:20:27Z
pubs.finish-date	2021-06-11
pubs.place-of-publication	Piscataway, USA
pubs.publication-status	Published
pubs.start-date	2021-06-06
pubs.volume	00
dc.location	Piscataway, USA

Abstract:

People can easily imagine the potential sound while seeing an event. This natural synchronization between audio and visual signals reveals their intrinsic correlations. To this end, we propose to learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner, the learned correlations can be then readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval. We introduce a novel Variational AutoEncoder (VAE) framework that consists of Multiple encoders and a Shared decoder (MS-VAE) with an additional Wasserstein distance constraint to tackle the problem. Extensive experiments demonstrate that the optimized latent representation of the proposed MS-VAE can effectively learn the audio-visual correlations and can be readily applied in multiple audio-visual downstream tasks to achieve competitive performance even without any given label information during training.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/158607