政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/159321

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | 全文筆數/總筆數 : 118786/149850 (79%)
造訪人次 : 81852406 線上人數 : 473

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

搜尋範圍

查詢小技巧：

您可在西文檢索詞彙前後加上"雙引號"，以獲取較精準的檢索結果

若欲以作者姓名搜尋，建議至進階搜尋限定作者欄位，可獲得較完整資料

進階搜尋

主頁 ‧ 登入 ‧ 上傳 ‧ 說明 ‧ 關於政大典藏 ‧ 管理

到手機版

政大機構典藏 > 理學院 > 應用數學系 > 學位論文 > Item 140.119/159321

請使用永久網址來引用或連結此文件: https://nccur.lib.nccu.edu.tw/handle/140.119/159321

題名:	基於階層式聚類的文本檢索樹於增強式生成系統之應用：以台灣法規為例 Tree-Based Text Retrieval via Hierarchical Clustering in RAG Frameworks：Application on Taiwanese Regulations
作者:	余嘉恆 Yu, Chia-Heng
貢獻者:	蔡炎龍 Tsai, Yeng-Lung 余嘉恆 Yu, Chia-Heng
關鍵詞:	檢索增強生成（RAG）階層式聚類向量語意檢索人工智慧語意層次結構文本檢索 Retrieval-Augmented Generation (RAG) Hierarchical Clustering Text Retrieval Semantic Vector Search AI Semantic Hierarchy
日期:	2025
上傳時間:	2025-09-01 16:31:04 (UTC+8)
摘要:	本篇文章將探討Retrieval-Augmented Generation (RAG)技術並將其應用於法規的檢索與生成。我們蒐集了台灣法規作為資料集，以pre-trained model將文字轉換為向量後，透過自行設計的演算法進行檢索。檢索方式採用餘弦距離作為Hierarchical clustering的度量，將相關的文本向量進行聚類，同時將使用者的query轉為向量後以Breadth-first search(BFS)找出tree中與query向量最接近的node，並將該node的子樹中的所有leaf node回傳作為檢索結果。生成方面，我們詢問了相關領域專家是如何逐步分解題目情境與梳理脈絡，將專家們的意見融合prompt當中，採用了CoT技術來引導模型從檢索文件中生成更完整且具結構性的回應。在經過專家評分與假設檢定實驗後，我們確認了相對於原始採用Sematic search的RAG系統，我們設計的系統有助於提高檢索結果的精準度，增加生成器回答時的正確率。 This study explores the application of Retrieval-Augmented Generation (RAG) techniques to the retrieval and generation of legal statutes. We compiled a dataset consisting of Taiwanese legal regulations and utilized a pre-trained model to encode the texts into vector representations. Retrieval is performed using a custom-designed algorithm that applies cosine distance as the similarity metric for hierarchical clustering of the document embeddings. User queries are also converted into vectors, and a Breadth-First Search (BFS) is employed to identify the node in the cluster tree most similar to the query vector. All leaf nodes within the corresponding subtree are returned as the retrieval results. For the generation stage, we cooperate with domain experts and enquiry how legal professionals interpret and decompose complex legal scenarios. Their guidance was embedded into the prompt design using Chain-of-Thought (CoT) techniques to steer the model in generating more complete and well-structured responses. Through expert evaluation and hypothesis testing, we demonstrate that our enhanced system improves retrieval precision, outperforming baseline RAG implementations based on standard semantic search.
參考文獻:	[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S.Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. [3] Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR, 2013, 01 2013. [4] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1188–1196, Bejing, China, 22–24 Jun 2014. PMLR. [5] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43(2), January 2025. [6] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. [7] Kishore Papineni. Why inverse document frequency? In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 1–8, 2001. [8] Peter F Brown, Vincent J Della Pietra, Peter V deSouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992. [9] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent trends in deep learning based natural language processing [review article]. IEEE Computational Intelligence Magazine, 13:55–75, 08 2018. [10] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. [11] Saeed Damadi, Golnaz Moharrer, Mostafa Cham, and Jinglai Shen. The backpropagation algorithm for a math student. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 01–09, 2023. [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. [13] Jimmy Ba, Jamie Kiros, and Geoffrey Hinton. Layer normalization, 2016. arXiv preprint arXiv:1607.06450. [14] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 807–814, 2010. [15] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Linguistics. [16] Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 6491–6501, New York, NY, USA, 2024. Association for Computing Machinery. [17] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. [18] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. [19] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. arXiv preprint arXiv:2302.13971. [20] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc. [21] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library, 2024. arXiv preprint arXiv:2401.08281. [22] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report, 2024. arXiv preprint arXiv:2402.05672. [23] OpenAI. Gpt-4o-mini models, 2025. Available at: https://platform.openai.com/docs/models/gpt-4o-mini (accessed: 2025-04-12).
描述:	碩士國立政治大學應用數學系 112751004
資料來源:	http://thesis.lib.nccu.edu.tw/record/#G0112751004
資料類型:	thesis
顯示於類別:	[應用數學系] 學位論文

文件中的檔案:

檔案	大小	格式	瀏覽次數
100401.pdf	1286Kb	Adobe PDF	0	檢視/開啟

在政大典藏中所有的資料項目都受到原著作權保護.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - 回饋