Voice-Face Homogeneity Tells Deepfake

Cheng, H; Guo, Y; Wang, T; Li, Q; Chang, X; Nie, L

Voice-Face Homogeneity Tells Deepfake

Cheng, H Guo, Y Wang, T Li, Q Chang, X

Nie, L

Permalink

Publisher:: ASSOC COMPUTING MACHINERY
Publication Type:: Journal Article
Citation:: ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 20, (3)
Issue Date:: 2023-11-11

Closed Access

	Filename	Description	Size
	Voice-Face Homogeneity Tells Deepfake.pdf	Accepted version	2.97 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Cheng, H
dc.contributor.author	Guo, Y
dc.contributor.author	Wang, T
dc.contributor.author	Li, Q
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Nie, L
dc.date.accessioned	2024-03-28T00:11:46Z
dc.date.available	2024-03-28T00:11:46Z
dc.date.issued	2023-11-11
dc.identifier.citation	ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 20, (3)
dc.identifier.issn	1551-6857
dc.identifier.issn	1551-6865
dc.identifier.uri	http://hdl.handle.net/10453/177304
dc.description.abstract	Detecting forgery videos is highly desirable due to the abuse of deepfake. Existing detection approaches contribute to exploring the specific artifacts in deepfake videos and fit well on certain data. However, the growing technique on these artifacts keeps challenging the robustness of traditional deepfake detectors. As a result, the development of these approaches has reached a blockage. In this article, we propose to perform deepfake detection from an unexplored voice-face matching view. Our approach is founded on two supporting points: first, there is a high degree of homogeneity between the voice and face of an individual (i.e., they are highly correlated), and second, deepfake videos often involve mismatched identities between the voice and face due to face-swapping techniques. To this end, we develop a voice-face matching method that measures the matching degree between these two modalities to identify deepfake videos. Nevertheless, training on specific deepfake datasets makes the model overfit certain traits of deepfake algorithms. We instead advocate a method that quickly adapts to untapped forgery, with a pre-training then fine-tuning paradigm. Specifically, we first pre-train the model on a generic audio-visual dataset, followed by the fine-tuning on downstream deepfake data. We conduct extensive experiments over three widely exploited deepfake datasets: DFDC, FakeAVCeleb, and DeepfakeTIMIT. Our method obtains significant performance gains as compared to other state-of-the-art competitors. For instance, our method outperforms the baselines by nearly 2%, achieving an AUC of 86.11% on FakeAVCeleb. It is also worth noting that our method already achieves competitive results when fine-tuned on limited deepfake data.
dc.language	English
dc.publisher	ASSOC COMPUTING MACHINERY
dc.relation.ispartof	ACM Transactions on Multimedia Computing, Communications and Applications
dc.relation.isbasedon	10.1145/3625231
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0803 Computer Software, 0805 Distributed Computing, 0806 Information Systems
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	4603 Computer vision and multimedia computation
dc.subject.classification	4606 Distributed computing and systems software
dc.subject.classification	4607 Graphics, augmented reality and games
dc.title	Voice-Face Homogeneity Tells Deepfake
dc.type	Journal Article
utslib.citation.volume	20
utslib.for	0803 Computer Software
utslib.for	0805 Distributed Computing
utslib.for	0806 Information Systems
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2024-03-28T00:11:45Z
pubs.issue	3
pubs.publication-status	Published
pubs.volume	20
utslib.citation.issue	3

Abstract:

Detecting forgery videos is highly desirable due to the abuse of deepfake. Existing detection approaches contribute to exploring the specific artifacts in deepfake videos and fit well on certain data. However, the growing technique on these artifacts keeps challenging the robustness of traditional deepfake detectors. As a result, the development of these approaches has reached a blockage. In this article, we propose to perform deepfake detection from an unexplored voice-face matching view. Our approach is founded on two supporting points: first, there is a high degree of homogeneity between the voice and face of an individual (i.e., they are highly correlated), and second, deepfake videos often involve mismatched identities between the voice and face due to face-swapping techniques. To this end, we develop a voice-face matching method that measures the matching degree between these two modalities to identify deepfake videos. Nevertheless, training on specific deepfake datasets makes the model overfit certain traits of deepfake algorithms. We instead advocate a method that quickly adapts to untapped forgery, with a pre-training then fine-tuning paradigm. Specifically, we first pre-train the model on a generic audio-visual dataset, followed by the fine-tuning on downstream deepfake data. We conduct extensive experiments over three widely exploited deepfake datasets: DFDC, FakeAVCeleb, and DeepfakeTIMIT. Our method obtains significant performance gains as compared to other state-of-the-art competitors. For instance, our method outperforms the baselines by nearly 2%, achieving an AUC of 86.11% on FakeAVCeleb. It is also worth noting that our method already achieves competitive results when fine-tuned on limited deepfake data.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/177304