Voice-Face Homogeneity Tells Deepfake

Publisher:
ASSOC COMPUTING MACHINERY
Publication Type:
Journal Article
Citation:
ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 20, (3)
Issue Date:
2023-11-11
Filename Description Size
Voice-Face Homogeneity Tells Deepfake.pdfAccepted version2.97 MB
Adobe PDF
Full metadata record
Detecting forgery videos is highly desirable due to the abuse of deepfake. Existing detection approaches contribute to exploring the specific artifacts in deepfake videos and fit well on certain data. However, the growing technique on these artifacts keeps challenging the robustness of traditional deepfake detectors. As a result, the development of these approaches has reached a blockage. In this article, we propose to perform deepfake detection from an unexplored voice-face matching view. Our approach is founded on two supporting points: first, there is a high degree of homogeneity between the voice and face of an individual (i.e., they are highly correlated), and second, deepfake videos often involve mismatched identities between the voice and face due to face-swapping techniques. To this end, we develop a voice-face matching method that measures the matching degree between these two modalities to identify deepfake videos. Nevertheless, training on specific deepfake datasets makes the model overfit certain traits of deepfake algorithms. We instead advocate a method that quickly adapts to untapped forgery, with a pre-training then fine-tuning paradigm. Specifically, we first pre-train the model on a generic audio-visual dataset, followed by the fine-tuning on downstream deepfake data. We conduct extensive experiments over three widely exploited deepfake datasets: DFDC, FakeAVCeleb, and DeepfakeTIMIT. Our method obtains significant performance gains as compared to other state-of-the-art competitors. For instance, our method outperforms the baselines by nearly 2%, achieving an AUC of 86.11% on FakeAVCeleb. It is also worth noting that our method already achieves competitive results when fine-tuned on limited deepfake data.
Please use this identifier to cite or link to this item: