English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 112704/143671 (78%)
Visitors : 49721840      Online Users : 709
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 >  Item 140.119/152567
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/152567


    Title: 跨領域多教師知識蒸餾的強化學習方法探索
    Exploration of Reinforcement Learning Methods for Cross-Domain Multi-Teacher Knowledge Distillation
    Authors: 洪得比
    Hong, De-Bi
    Contributors: 謝佩璇
    Hsieh, Pei-Hsuan
    洪得比
    Hong, De-Bi
    Keywords: 自然語言處理
    知識蒸餾
    強化學習
    領域適應
    Natural Language Processing
    Knowledge Distillation
    Reinforcement Learning
    Domain Adaptation
    Date: 2024
    Issue Date: 2024-08-05 12:44:52 (UTC+8)
    Abstract: 隨著社群媒體的普及,情感分析技術在捕捉社會輿情動態方面發揮著越來
    越重要的作用。然而,現有的大型情感分析模型雖然效能優異,但其龐大的參數量和運算成本也帶來了效率和成本方面的挑戰,尤其是在資源受限的情境下,而且對於蒐集到大量資料的資料集,需要做人工標記的話需要花費大量時間,這樣訓練模型得花費大量時間。為了解決這一問題,本研究提出了一種基於跨領域動態知識蒸餾的情感分析模型壓縮方法。
    首先,本研究創新性的提出了一種動態教師選擇策略。傳統的知識蒸餾通常使用固定的教師模型,而本研究利用強化學習,根據學生模型的狀態表示動態的選擇最優的教師模型組合,以提供更有效的知識指導,進一步提升了知識蒸餾的效率。其次,本研究在知識蒸餾的基礎上,引入了跨領域的概念。透過從多個源領域選擇教師模型,並利用隱藏層和注意力機制匹配學生模型和教師模型的特徵表示,提出了一種跨領域知識蒸餾損失函數,以縮小學生模型在目標領域上的效能差距。
    在多個評論資料集上的實驗表明,本研究提出的方法在顯著壓縮模型規模的同時,仍然保持與大型教師模型相當的表現。例如,在使用BERT-base 作為教師模型時,壓縮後的6層和3層BERT學生模型在情感二分類任務上的準確率比傳統KD提升0.2% 到1%,但模型參數量和運算時間卻大幅減少。本研究提出的跨領域動態知識蒸餾方法為大型情感分析模型的應用提供了一種新的解決方法和技術方案。
    With the growing popularity of social media, sentiment analysis techniques play apivotal role in capturing the dynamics of public opinion. However, while large-scale sentiment analysis models exhibit excellent performance, their vast parameter sizes and computational costs pose efficiency and cost challenges, especially in resource-constrained environments.
    This research proposes a sentiment analysis model compression method based on cross-domain dynamic knowledge distillation. Firstly, this research innovatively
    introduces a dynamic teacher selection strategy that utilizes reinforcement learning to dynamically choose the optimal combination of teacher models based on the student
    model’s state representation, providing more effective knowledge guidance. Secondly,this research introduces the concept of cross-domain by selecting teacher models from
    multiple source domains.
    It proposes a cross-domain knowledge distillation loss function that employs hidden layers and attention mechanisms to align the feature representations of student and teacher models, reducing the performance gap of the student model in the target domain.
    Experiments on multiple review datasets demonstrate that the proposed method maintains performance comparable to large teacher models while significantly compressing the model size.For example, when using BERT-base as the teacher model,
    the Accuracy of the compressed 6-layer and 3-layer BERT student models on the sentiment binary classification task is improved by 0.2% to 1% compared to traditional KD, this
    research offers a novel solution and technique for the application of large-scale sentiment analysis models.
    Reference: [1] Sungsoo Ahn et al. “Variational information distillation for knowledge transfer”. In:
    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
    2019, pp. 9163–9171.
    [2] Daniel Campos et al. “oBERTa: Improving Sparse Transfer Learning via improved
    initialization, distillation, and pruning regimes”. In: arXiv preprint arXiv:2303.17612
    (2023).
    [3] Defang Chen et al. “Online knowledge distillation with diverse peers”. In: Proceedings
    of the AAAI conference on artificial intelligence. Vol. 34. 04. 2020, pp. 3430–
    3437.
    [4] Yahui Chen. “Convolutional neural network for sentence classification”. MA thesis.
    University of Waterloo, 2015.
    [5] Jang Hyun Cho and Bharath Hariharan. “On the efficacy of knowledge distillation”.
    In: Proceedings of the IEEE/CVF international conference on computer vision.
    2019, pp. 4794–4802.
    [6] Kyunghyun Cho et al. “Learning phrase representations using RNN encoder-decoder
    for statistical machine translation”. In: arXiv preprint arXiv:1406.1078 (2014).
    [7] Jacob Devlin et al. “Bert: Pre-training of deep bidirectional transformers for language
    understanding”. In: arXiv preprint arXiv:1810.04805 (2018).
    [8] Prakhar Ganesh et al. “Compressing Large-Scale Transformer-Based Models: A
    Case Study on BERT”. In: Transactions of the Association for Computational Linguistics 9 (2021). Ed. by Brian Roark and Ani Nenkova, pp. 1061–1080. DOI: 10.
    1162/tacl_a_00413. URL: https://aclanthology.org/2021.tacl-1.63.
    [9] Tao Ge, Si-Qing Chen, and Furu Wei. “EdgeFormer: A parameter-efficient transformer
    for on-device Seq2Seq generation”. In: arXiv preprint arXiv:2202.07959
    (2022).
    [10] Jianping Gou et al. “Knowledge distillation: A survey”. In: International Journal
    of Computer Vision 129 (2021), pp. 1789–1819.
    [11] Cyril Goutte and Eric Gaussier. “A probabilistic interpretation of precision, recall
    and F-score, with implication for evaluation”. In: European conference on information
    retrieval. Springer. 2005, pp. 345–359.
    [12] Vasileios Hatzivassiloglou and Kathleen McKeown. “Predicting the semantic orientation
    of adjectives”. In: 35th annual meeting of the association for computational
    linguistics and 8th conference of the european chapter of the association for
    computational linguistics. 1997, pp. 174–181.
    [13] Xin He, Kaiyong Zhao, and Xiaowen Chu. “AutoML: A survey of the state-of-theart”.
    In: Knowledge-based systems 212 (2021), p. 106622.
    [14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural
    network”. In: arXiv preprint arXiv:1503.02531 (2015).
    [15] Torsten Hoefler et al. “Sparsity in deep learning: Pruning and growth for efficient
    inference and training in neural networks”. In: The Journal of Machine Learning
    Research 22.1 (2021), pp. 10882–11005.
    [16] Timothy Hospedales et al. “Meta-learning in neural networks: A survey”. In: IEEE
    transactions on pattern analysis and machine intelligence 44.9 (2021), pp. 5149–
    5169.
    [17] Aref Jafari et al. Annealing Knowledge Distillation. 2021. arXiv: 2104 . 07163
    [cs.CL]. URL: https://arxiv.org/abs/2104.07163.
    [18] Xiaoqi Jiao et al. “Tinybert: Distilling bert for natural language understanding”. In:
    arXiv preprint arXiv:1909.10351 (2019).
    [19] Sayyida Tabinda Kokab, Sohail Asghar, and Shehneela Naz. “Transformer-based
    deep learning models for the sentiment analysis of social media data”. In: Array 14
    (2022), p. 100157.
    [20] Solomon Kullback and Richard A Leibler. “On information and sufficiency”. In:
    The annals of mathematical statistics 22.1 (1951), pp. 79–86.
    [21] Eldar Kurtic and Dan Alistarh. “Gmp*: Well-tuned global magnitude pruning can
    outperform most bert-pruning methods”. In: arXiv preprint arXiv:2210.06384 (2022).
    [22] Yann LeCun et al. “Gradient-based learning applied to document recognition”. In:
    Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.
    [23] Lei Li et al. “Dynamic knowledge distillation for pre-trained language models”. In:
    arXiv preprint arXiv:2109.11295 (2021).
    [24] Zheng Li et al. “Curriculum temperature for knowledge distillation”. In: Proceedings
    of the AAAI Conference on Artificial Intelligence. Vol. 37. 2. 2023, pp. 1504–
    1512.
    [25] Chang Liu et al. “Multi-granularity structural knowledge distillation for language
    model compression”. In: Proceedings of the 60th Annual Meeting of the Association
    for Computational Linguistics (Volume 1: Long Papers). 2022, pp. 1001–1011.
    [26] Yinhan Liu et al. “Roberta: A robustly optimized bert pretraining approach”. In:
    arXiv preprint arXiv:1907.11692 (2019).
    [27] Shie Mannor, Dori Peleg, and Reuven Rubinstein. “The cross entropy method for
    classification”. In: Proceedings of the 22nd international conference on Machine
    learning. 2005, pp. 561–568.
    [28] Shervin Minaee et al. Large Language Models: A Survey. 2024. arXiv: 2402.06196
    [cs.CL].
    [29] Seyed Iman Mirzadeh et al. “Improved knowledge distillation via teacher assistant”.
    In: Proceedings of the AAAI conference on artificial intelligence. Vol. 34. 04. 2020,
    pp. 5191–5198.
    [30] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. “When does label smoothing
    help?” In: Advances in neural information processing systems 32 (2019).
    [31] Manish Munikar, Sushil Shakya, and Aakash Shrestha. “Fine-grained sentiment
    classification using BERT”. In: 2019 Artificial Intelligence for Transforming Business
    and Society (AITB). Vol. 1. IEEE. 2019, pp. 1–5.
    [32] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. “Thumbs up? Sentiment
    classification using machine learning techniques”. In: arXiv preprint cs/0205070
    (2002).
    [33] Wonpyo Park et al. “Relational knowledge distillation”. In: Proceedings of the
    IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 3967–
    3976.
    [34] Gabriel Pereyra et al. “Regularizing neural networks by penalizing confident output
    distributions”. In: arXiv preprint arXiv:1701.06548 (2017).
    [35] Adriana Romero et al. FitNets: Hints for Thin Deep Nets. 2015. arXiv: 1412.6550
    [cs.LG].
    [36] Fabian Ruffy and Karanbir Chahal. The State of Knowledge Distillation for Classification.
    2019. arXiv: 1912.10850 [cs.LG].
    [37] Victor Sanh et al. “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper
    and lighter. arXiv 2019”. In: arXiv preprint arXiv:1910.01108 (2019).
    [38] Richard Socher et al. “Recursive deep models for semantic compositionality over a
    sentiment treebank”. In: Proceedings of the 2013 conference on empirical methods
    in natural language processing. 2013, pp. 1631–1642.
    [39] Aarohi Srivastava et al. “Beyond the imitation game: Quantifying and extrapolating
    the capabilities of language models”. In: arXiv preprint arXiv:2206.04615 (2022).
    [40] Chi Sun, Luyao Huang, and Xipeng Qiu. “Utilizing BERT for aspect-based sentiment
    analysis via constructing auxiliary sentence”. In: arXiv preprint arXiv:1903.09588
    (2019).
    [41] Siqi Sun et al. “Patient Knowledge Distillation for BERT Model Compression”. In:
    Proceedings of the 2019 Conference on Empirical Methods in Natural Language
    Processing and the 9th International Joint Conference on Natural Language Processing
    (EMNLP-IJCNLP). 2019, pp. 4323–4332.
    [42] Zhiqing Sun et al. “Mobilebert: a compact task-agnostic bert for resource-limited
    devices”. In: arXiv preprint arXiv:2004.02984 (2020).
    [43] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information
    processing systems 30 (2017).
    [44] Elena Voita et al. “Analyzing multi-head self-attention: Specialized heads do the
    heavy lifting, the rest can be pruned”. In: arXiv preprint arXiv:1905.09418 (2019).
    [45] Alex Wang et al. “GLUE: A multi-task benchmark and analysis platform for natural
    language understanding”. In: arXiv preprint arXiv:1804.07461 (2018).
    [46] Alex Wang et al. “Superglue: A stickier benchmark for general-purpose language
    understanding systems”. In: Advances in neural information processing systems 32
    (2019).
    [47] Jin Wang et al. “Dimensional sentiment analysis using a regional CNN-LSTM
    model”. In: Proceedings of the 54th annual meeting of the association for computational
    linguistics (volume 2: Short papers). 2016, pp. 225–230.
    [48] Yukang Wei and Yu Bai. Dynamic Temperature Knowledge Distillation. 2024. arXiv:
    2404.12711 [cs.LG]. URL: https://arxiv.org/abs/2404.12711.
    [49] Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist
    reinforcement learning”. In: Machine learning 8 (1992), pp. 229–256.
    [50] Canwen Xu and Julian McAuley. “A survey on model compression and acceleration
    for pretrained language models”. In: Proceedings of the AAAI Conference on
    Artificial Intelligence. Vol. 37. 9. 2023, pp. 10566–10575.
    [51] Hu Xu et al. “BERT post-training for review reading comprehension and aspectbased
    sentiment analysis”. In: arXiv preprint arXiv:1904.02232 (2019).
    [52] Zichao Yang et al. “Hierarchical attention networks for document classification”.
    In: Proceedings of the 2016 conference of the North American chapter of the association
    for computational linguistics: human language technologies. 2016, pp. 1480–
    1489.
    [53] Junho Yim et al. “A gift from knowledge distillation: Fast optimization, network
    minimization and transfer learning”. In: Proceedings of the IEEE conference on
    computer vision and pattern recognition. 2017, pp. 4133–4141.
    [54] Shan You et al. “Learning from multiple teacher networks”. In: Proceedings of the
    23rd ACM SIGKDD international conference on knowledge discovery and data
    mining. 2017, pp. 1285–1294.
    [55] Fei Yuan et al. “Reinforced multi-teacher selection for knowledge distillation”. In:
    Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. 16. 2021,
    pp. 14284–14291.
    [56] Lin Yue et al. “A survey of sentiment analysis in social media”. In: Knowledge and
    Information Systems 60 (2019), pp. 617–663.
    [57] Ofir Zafrir et al. “Prune once for all: Sparse pre-trained language models”. In: arXiv
    preprint arXiv:2111.05754 (2021).
    [58] Sergey Zagoruyko and Nikos Komodakis. Paying More Attention to Attention: Improving
    the Performance of Convolutional Neural Networks via Attention Transfer.
    2017. arXiv: 1612.03928 [cs.CV].
    [59] Chen Zhang, Qiuchi Li, and Dawei Song. “Aspect-based sentiment classification
    with aspect-specific graph convolutional networks”. In: arXiv preprint arXiv:1909.03477
    (2019).
    [60] Linfeng Zhang et al. “Be your own teacher: Improve the performance of convolutional
    neural networks via self distillation”. In: Proceedings of the IEEE/CVF
    international conference on computer vision. 2019, pp. 3713–3722.
    [61] Shaokang Zhang, Lei Jiang, and Jianlong Tan. “Cross-domain knowledge distillation
    for text classification”. In: Neurocomputing 509 (2022), pp. 11–20.
    [62] Han Zhao et al. Multiple Source Domain Adaptation with Adversarial Training of
    Neural Networks. 2017. arXiv: 1705.09684 [cs.LG].
    Description: 碩士
    國立政治大學
    資訊科學系
    111753117
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0111753117
    Data Type: thesis
    Appears in Collections:[資訊科學系] 學位論文

    Files in This Item:

    File Description SizeFormat
    311701.pdf1390KbAdobe PDF0View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback