Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/157811
|
Title: | 基於深度學習的深偽聲音檢測與語者辨識:二元分類、多分類與跨模型檢測研究 Deep Learning-Based Deepfake Voice Detection and Speaker Identification: A Study on Binary Classification, Multi-class Classification, and Cross-Model Detection |
Authors: | 洪瑞甫 Hung, Jui-Fu |
Contributors: | 廖文宏 Liao, Wen-Hung 洪瑞甫 Hung, Jui-Fu |
Keywords: | 深度學習 深偽聲音 深偽聲音檢測 Deep Learning Audio Deepfake Audio Deepfake Detection |
Date: | 2025 |
Issue Date: | 2025-07-01 15:06:15 (UTC+8) |
Abstract: | 聲音合成技術在娛樂、客服、教育等領域都能有廣泛應用,但也伴隨著詐騙風險,為了因應這類潛在威脅,本研究從多個面向探討深偽聲音檢測技術的效果。實驗從最基本的二元分類開始,檢視 ResNet 模型在真實與 GPT-SoVITS、OpenVoice、XTTS_v2 三種來源深偽聲音中,區分真、假聲音的能力,而由於真實世界的深偽聲音詐騙案未必能取得乾淨的音訊檔案,因此在二元分類任務中,亦加入摻雜背景雜訊的情境。
深偽聲音詐騙除了要讓聽者誤將假聲音認成真實聲音、還要讓其音色及說話方式與真實語者相近,因此,本研究加入包含真、假聲音的語者辨識多分類實驗,觀察分類模型在判斷聲音的真假之餘,是否還具備辨識語者身分的能力。
隨著深偽聲音技術不斷推陳出新,上述二元分類及多分類模型的泛化性是衡量深偽聲音檢測能力的重要指標,本研究除讓真實聲音搭配特定來源深偽聲音為測試集,交叉測試以不同來源深偽聲音訓練之模型表現,也讓模型的訓練資料集包含更多來源的深偽聲音,檢視此一作法在提升模型泛化性的幫助。
實驗結果顯示,用任一深偽聲音來源訓練的模型,被用於偵測同一來源深偽聲音時,不論二元分類或多分類任務,模型皆能有穩健的高水準表現,但若訓練資料集與測試資料集深偽聲音來源不同,模型表現則不太穩定。在二元分類任務上,使用三元組損失訓練可有效改善;而在多分類任務中,先以三元組損失訓練二分類任務,再以交叉熵為主、中心損失為輔的方式訓練多分類任務,兩任務以 KL 散度維持一致性可有效提升分類模型在跨深偽聲音來源的情境中,區分語者身分與該位語者真、假聲音的能力。 Voice cloning has wide applications in entertainment, customer service, and education, but poses significant fraud risks. This study investigates deepfake audio detection techniques from various perspectives.
We begin with a binary classification task using a ResNet model to distinguish between authentic and deepfake audio generated by GPT-SoVITS, OpenVoice, and XTTS_v2. Considering real-world fraud, we also evaluate the model under conditions with background noise.
Since deepfake fraud involves deceiving listeners into mistaking fake audio for genuine speech and convincing them that the voice belongs to a target speaker, we also conduct a speaker identification multi-class task to see if the model can detect deepfakes and identify the speaker.
With the rapid evolution of deepfake technology, model generalization capability is crucial. We assess this by cross-testing with various deepfake sources and expanding the training data to improve the model’s robustness.
The experimental results indicate that models trained with deepfake audio from a single source exhibit robust, high-level performance when detecting deepfake audio from the same source, whether in binary or multi-class classification tasks. However, the performance becomes less stable when the sources of deepfake audio in the training and testing datasets differ.
Triplet loss effectively improves the performance of binary classification tasks. In multi-class classification tasks, a two-stage training strategy proves beneficial: first, binary classification is trained using triplet loss, followed by multi-class training primarily with cross-entropy loss supplemented by center loss. By maintaining consistency between the two tasks using KL divergence, the classification model significantly improves the distinguishing of speaker identities and differentiating between genuine and fake audio for a given speaker across varying deepfake audio sources. |
Reference: | [1] J. Yi, C. Wang, J. Tao, X. Zhang, C. Y. Zhang, and Y. Zhao, “Audio deepfake detection: A survey,” 2023. [2] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” 2017. [3] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” 2018. [4] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020. [5] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” 2021. [6] G. Ruggiero, E. Zovato, L. D. Caro, and V. Pollet, “Voice cloning: a multi-speaker text-to-speech synthesis approach based on transfer learning,” 2021. [7] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” 2019. [8] X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, p. 2507–2522, 2023. [9] J. Yi, C. Y. Zhang, J. Tao, C. Wang, X. Yan, Y. Ren, H. Gu, and J. Zhou, “Add 2023: Towards audio deepfake detection and analysis in the wild,” 2024. [10] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” 2020. [11] Z. Qin, W. Zhao, X. Yu, and X. Sun, “Openvoice: Versatile instant voice cloning,”2024. [12] E. Casanova, K. Davis, E. Gölge, G. Göknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. Weber, “Xtts: a massively multilingual zero-shot text-to-speech model,” 2024. [13] S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, and F. Wei, “Beats: Audio pre-training with acoustic tokenizers,” 2022. [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015. [15] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p. 815–823, IEEE, June 2015. [16] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner, “Face2face: Real-time face capture and reenactment of rgb videos,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2387–2395, 2016. [17] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizing obama: learning lip sync from audio,” ACM Trans. Graph., vol. 36, jul 2017. [18] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” 2021. [19] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2022. [20] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” 2014. [21] M. Li, Y. Ahmadiadli, and X.-P. Zhang, “Audio anti-spoofing detection: A survey,” 2024. [22] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” 2016. [23] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” 2016. [24] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, “Conditional image generation with pixelcnn decoders,” 2016. [25] X. Gonzalvo, S. Tazari, C. an Chan, M. Becker, A. Gutkin, and H. Silen, eds., Recent Advances in Google Real-time HMM-driven Unit Selection Synthesizer, (Sep 8–12, San Francisco, USA), 2016. [26] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak, “Fast, compact, and high quality lstm-rnn based statistical parametric speech synthesizers for mobile devices,” 2016. [27] D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984. [28] M. Schuster and K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997. [29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, pp. 1735–80, 12 1997. [30] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990. [31] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2022. [32] J. Kong, J. Park, B. Kim, J. Kim, D. Kong, and S. Kim, “Vits2: Improving quality and efficiency of single-stage text-to-speech with adversarial learning and architecture design,” 2023. [33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023. [34] X. Zhu, Y. Li, Y. Lei, N. Jiang, G. Zhao, and L. Xie, “Boosting multi-speaker expressive speech synthesis with semi-supervised contrastive learning,” 2024. [35] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” 2015. [36] R. Huang, M. W. Y. Lam, J. Wang, D. Su, D. Yu, Y. Ren, and Z. Zhao, “Fastdiff: A fast conditional diffusion model for high-quality speech synthesis,” 2022. [37] X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He, F. Soong, T. Qin, S. Zhao, and T.-Y. Liu, “Naturalspeech: End-to-end text to speech synthesis with human-level quality,” 2022. [38] K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian, “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” 2023. [39] Z. Ju, Y. Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y. Liu, Y. Leng, K. Song, S. Tang, Z. Wu, T. Qin, X.-Y. Li, W. Ye, S. Zhang, J. Bian, L. He, J. Li, and S. Zhao, “Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,” 2024. [40] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M. Sahidullah, and A. Sizov, “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in Interspeech 2015, pp. 2037–2041, 2015. [41] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “The asvspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” in Interspeech 2017, pp. 2–6, 2017. [42] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “Asvspoof 2019: Future horizons in spoofed and fake audio detection,” 2019. [43] J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans, and H. Delgado, “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” 2021. [44] J. C. Brown, “Calculation of a constant q spectral transform,” Journal of the Acoustical Society of America, vol. 89, pp. 425–434, 1991. [45] J. Li, H. Wang, P. He, S. M. Abdullahi, and B. Li, “Long-term variable q transform: A novel time-frequency transform algorithm for synthetic speech detection,” Digital Signal Processing, vol. 120, p. 103256, 2022. [46] X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep face representation with noisy labels,” 2018. [47] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, pp. 1735–1780, 11 1997. [48] D. Reynolds, Gaussian Mixture Models, pp. 659–663. Boston, MA: Springer US, 2009. [49] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989. [50] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986. [51] R. K. Das, “Known-unknown data augmentation strategies for detection of logical access, physical access and speech deepfake attacks: Asvspoof 2021,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 29–36, 2021. [52] A. Tomilov, A. Svishchev, M. Volkova, A. Chirkovskiy, A. Kondratev, and G. Lavrentyeva, “Stc antispoofing systems for the asvspoof2021 challenge,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 61–67, 2021. [53] J. Cáceres, R. Font, T. Grau, and J. Molina, “The biometric vox system for the asvspoof 2021 challenge,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 68–74, 2021. [54] X. Chen, Y. Zhang, G. Zhu, and Z. Duan, “Ur channel-robust synthetic speech detection system for asvspoof 2021,” 2021. [55] T. Chen, E. Khoury, K. Phatak, and G. Sivaraman, “Pindrop labs’ submission to the asvspoof 2021 challenge,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 89–93, 2021. [56] J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, X. Zhang, Y. Bai, C. Fan, S. Liang, S. Wang, S. Zhang, X. Yan, L. Xu, Z. Wen, H. Li, Z. Lian, and B. Liu, “Add 2022: the first audio deep synthesis detection challenge,” 2024. [57] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” 2018. [58] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” 2019. [59] F. Alegre, A. Amehraye, and N. Evans, “A one-class classification approach to generalised speaker verification spoofing countermeasures using local binary patterns,” in 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), pp. 1–8, 2013. [60] P. L. D. Leon, B. Stewart, and J. Yamagishi, “Synthetic speech discrimination using pitch pattern statistics derived from image analysis,” in Interspeech 2012, pp. 370–373, 2012. [61] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [62] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” 2019. [63] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009. [64] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” 2019. [65] X. Liu, M. Liu, L. Wang, K. A. Lee, H. Zhang, and J. Dang, “Leveraging positional-related local-global dependency for synthetic speech detection,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023. [66] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “Xls-r: Selfsupervised cross-lingual speech representation learning at scale,” 2021. [67] D. Reynolds and R. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, pp. 72–83, 1995. [68] J. weon Jung, H. Tak, H. jin Shim, H.-S. Heo, B.-J. Lee, S.-W. Chung, H.-J. Yu, N. Evans, and T. Kinnunen, “Sasv 2022: The first spoofing-aware speaker verification challenge,” 2022. [69] X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, and J. Yamagishi, “Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,” 2024. [70] Y. Zhang, Y. Zang, J. Shi, R. Yamamoto, T. Toda, and Z. Duan, “Svdd 2024: The inaugural singing voice deepfake detection challenge,” 2024. [71] J. Rohdin, L. Zhang, O. Plchot, V. Staněk, D. Mihola, J. Peng, T. Stafylakis, D. Beveraki, A. Silnova, J. Brukner, and L. Burget, “But systems and analyses for the asvspoof 5 challenge,” 2024. [72] Y. Chen, H. Wu, N. Jiang, X. Xia, Q. Gu, Y. Hao, P. Cai, Y. Guan, J. Wang, W. Xie, L. Fang, S. Fang, Y. Song, W. Guo, L. Liu, and M. Xu, “Ustc-kxdigit system description for asvspoof5 challenge,” 2024. [73] D. Combei, A. Stan, D. Oneata, and H. Cucu, “Wavlm model ensemble for audio deepfake detection,” 2024. [74] Y. Xie, X. Wang, Z. Wang, R. Fu, Z. Wen, H. Cheng, and L. Ye, “Temporal variability and multi-viewed self-supervised representations to tackle the asvspoof5 deepfake challenge,” 2024. [75] Y. Xu, J. Zhong, S. Zheng, Z. Liu, and B. Li, “Szu-afs antispoofing system for the asvspoof 5 challenge,” 2024. [76] N. M. Müller, P. Czempin, F. Dieckmann, A. Froghyar, and K. Böttinger, “Does audio deepfake detection generalize?,” 2024. [77] Jianchang512, “Clone-voice: A sound cloning tool with a web interface.” https://github.com/jianchang512/clone-voice, 2023. GitHub repository. [78] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” 2018. [79] A. Jaegle, F. Gimeno, A. Brock, A. Zisserman, O. Vinyals, and J. Carreira, “Perceiver: General perception with iterative attention,” 2021. [80] W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,” 2021. [81] Microsoft, “Beats: Audio pre-training with acoustic tokenizers.” https://github.com/microsoft/unilm/tree/master/beats, 2022. GitHub repository |
Description: | 碩士 國立政治大學 資訊科學系 111753131 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0111753131 |
Data Type: | thesis |
Appears in Collections: | [資訊科學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
313101.pdf | | 12861Kb | Adobe PDF | 5 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|