政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/132064

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 109951/140892 (78%)
Visitors : 46215582 Online Users : 1054

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 理學院 > 應用數學系 > 學位論文 > Item 140.119/132064

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/132064

Title:	跨語言遷移學習在惡意留言偵測上的應用 Cross-lingual Transfer Learning for Toxic Comment Detection
Authors:	陳冠宇 Chen, Kuan-Yu
Contributors:	蔡炎龍陳冠宇 Chen, Kuan-Yu
Keywords:	Transformer XLM-R 跨語言預測惡意留言不平衡數據深度學習對話安全 Transformer XLM-R cross-lingual prediction toxic comment imbalanced data deep learning security conversations
Date:	2020
Issue Date:	2020-10-05 15:16:14 (UTC+8)
Abstract:	Transformer這個模型，它開啟了自然語言處理領域的一道大門，使得這個領域往前邁進了一大步，它讓模型更了解了文字中的關係。並且它的模型架構延伸了許多語言模型，例如跨語言模型的XLM，XLM-R，而這些延伸出來的模型在各個任務中都獲得了很好的成績。在本篇論文中，我們證實了可以透過其他高資源的語言來彌補低資源的語言的資料量，我們以預測留言是否是惡意留言來做為例子，我們分別使用Jigsaw Multilingual Toxic Comment Classification 競賽所釋出的英文資料和PTT黑特版上的留言當做輸入的訓練集，並要模型預測中文的惡意留言，而且英文的資料量比中文的資料量多出很多，我們將其預測結果分為三個種類分別是單純以英文資料訓練模型，單純以中文資料訓練模型，最後是將兩者的資料結合並訓練模型，發現在以英文資料的訓練因為其資料量較大使得其預測結果為最好有75.9% 的水準，而以總體預測水準來說為混合型的資料分數較高有88.3%。總體來說，我們可以透過跨語言模型來補足低資源語言的不足，並且有了另一種解決低語言資料的方法。 The Transformer model, which opens a door in the field of natural language processing, makes this field has another significant further step. It allows the model to better understand the relationship in the word. And the model architecture extends many language models, such as cross-lingual model XLM, XLM-R, and these models have achieved good results in various tasks. In this paper, we proved that other high-resource languages can be used to make up for the data in low-resource languages. We take the prediction of whether the comment is a toxic message as an example. We use the English data released by the Jigsaw Multilingual Toxic Comment Classification competition and the comment on the PTT Hate board as the input training set. We want the model to predict toxic comment in Chinese, and the data in English is much larger than that in Chinese. We divide the prediction results into three categories: only use English data to fine-tune the model, fine-tune the model with Chinese data, and the last is combine the two data and fine-tune the model. We found that the training with English data has the best accuracy score of 75.9% because of the large amount of data, while the overall accuracy scores that mixed data has a higher score of 88.3%. In general, we can make up for the lack of low-resource languages through cross-lingual models and have another way to solve low-resource languages problem.
Reference:	[1] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018. [2] KR Chowdhary. Natural language processing. In Fundamentals of Artificial Intelligence, pages 603–649. Springer, 2020. [3] David A Cieslak, Nitesh V Chawla, and Aaron Striegel. Combating imbalance in network intrusion datasets. In GrC, pages 732–737, 2006. [4] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised crosslingual representation learning at scale, 2019. [5] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding, 2018. [6] Chris Drummond, Robert C Holte, et al. C4. 5, class imbalance, and cost sensitivity: why undersampling beats oversampling. In Workshop on learning from imbalanced datasets II, volume 11, pages 1–8. Citeseer, 2003. [7] Guo Haixiang, Li Yijing, Jennifer Shang, Gu Mingyun, Huang Yuanyue, and Gong Bing. Learning from classimbalanced data: Review of methods and applications. Expert Systems with Applications, 73:220–239, 2017. [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. [9] Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. Deceiving google’s perspective api built for detecting toxic comments, 2017. [10] Anil K Jain, Jianchang Mao, and KM Mohiuddin. Artificial neural networks: A tutorial. Computer, (3):31–44, 1996. [11] Miroslav Kubat, Robert C Holte, and Stan Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine learning, 30(23): 195–215, 1998. [12] Guillaume Lample and Alexis Conneau. Crosslingual language model pretraining. CoRR, abs/1901.07291, 2019. [13] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553): 436–444, 2015. [14] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019. [15] R Bharat Rao, Sriram Krishnan, and Radu Stefan Niculescu. Data mining for improved cardiac care. ACM SIGKDD Explorations Newsletter, 8(1):3–10, 2006. [16] Daniel Svozil, Vladimir Kvasnicka, and Jiri Pospichal. Introduction to multilayer feedforward neural networks. Chemometrics and intelligent laboratory systems, 39(1):43–62, 1997. [17] Wilson L. Taylor. “cloze procedure＂: A new tool for measuring readability. Journalism Quarterly, 30(4):415–433, 1953. [18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. [19] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation, 2016. [20] Show-Jane Yen and Yue-Shi Lee. Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In Intelligent Control and Automation, pages 731–740. Springer, 2006. [21] Dong Yu and Li Deng. Automatic Speech Recognition: A Deep Learning Approach. Springer Publishing Company, Incorporated, 2014.
Description:	碩士國立政治大學應用數學系 107751010
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0107751010
Data Type:	thesis
DOI:	10.6814/NCCU202001728
Appears in Collections:	[應用數學系] 學位論文

Files in This Item:

File	Description	Size	Format
101001.pdf		1346Kb	Adobe PDF2	92	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback