政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/150166
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 110829/141756 (78%)
Visitors : 47421328      Online Users : 694
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/150166

    Title: 綜合分群技術與 BERT 模型於文件推薦的探索
    An Exploration of Integrating Clustering and BERT Models for Document Recommendation
    Authors: 陳筠
    Chen, Yun
    Contributors: 劉昭麟
    Liu, Chao-Lin
    Chen, Yun
    Keywords: 深度學習
    Deep learning
    document embeddings
    semi-supervised clustering
    Date: 2024
    Issue Date: 2024-03-01 13:41:20 (UTC+8)
    Abstract: 當試從大量資料中挑選出有興趣的類別內容時,往往需花費人力資源進行瀏覽,或標記資料以分類。相較之下,分群得將類似的文本分在同群,是個更快速且節省成本的方式。故為更有效地找到類似資料以進行文件推薦,本研究透過微調的 BERT 對文本進行向量化再以 K-means 分群,並實驗指定起始點的「種子分群」方式,以期達資料無標記、只需少量線索即可有效分群之效。
    實驗結果顯示,文本透過微調 BERT 向量化後的分群結果,遠勝於未微調 BERT 及以 TF-IDF 向量化的分群效果。然同時也發現,BERT 投入 K-means 分群的穩定性極高,導致每次分群結果幾無差別,也影響到種子分群之結果,使得本研究中的種子分群方法對分群的改善甚微。是故未來相關研究可在以微調 BERT 進行文本向量化的基礎之上,嘗試其他分群和種子分群的方式。
    When users try to find similar contents or documents they’re interested in from an abundance of data, remarkable resources are usually spent on human reviewing or labeling for the classification. In contrast, clustering, which can assign similar documents in the same clusters, is faster and more cost-saving. Therefore, to find similar contents more efficiently, in this research, documents are vectorized through fine-tuned BERT models and clustered by K-means, and by “seed clustering”, which is clustering with appointed initial centroids.
    The study shows that the clustering with the fine-tuned BERT embeddings outperforms those of BERT without fine-tuning and those of TF-IDF. However, it is found that K-means clustering of BERT embeddings has high stability, causing the results throughout multiple times of clustering to remain nearly identical, which also affects the performance of the seed clustering. The methods of seed clustering thus are shown to have little effect on improving the clustering. Therefore, it is suggested that research in the future be based on fine-tuned BERT embeddings but in different ways of clustering or seed clustering.
    Reference: [1] C. D. Manning, P. Raghavan and H. Schütze, “Flat Clustering”, in Introduction to Information Retrieval, online ed. Cambridge, England: Cambridge UP, 2009, ch16, pp. 349-350, 354, 356, 357, 360.
    [2] A. Vaswani, et al., “Attention is all you need," in Advances in Neural Information Processing Systems, 30, 2017.
    [3] Y. Cui, W. Che, T. Liu, B. Qin and Z. Yang, “Pre-training with whole word masking for Chinese BERT.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3504-3514, 2021.
    [4] A. Subakti, H. Murfi and N. Hariadi, “The performance of bert as data representation of text clustering,” Journal of Big Data, vol. 9, no. 1, pp. 1-21, 2022.
    [5] S. Basu, A. Banerjee and R. Mooney, “Semi-supervised clustering by seeding,” in Proc. of the 10th International Conference on Machine Learning (ICML-2002), Sydney, Australia, July, 2002.
    [6] M. Bilenko, S. Basu and R. Mooney, “Integrating constraints and metric learning in semi-supervised clustering,” in Proc. of the 21st International Conference on Machine Learning, (ICML-2004), Banff, Canada, July, 2004.
    [7] Z. Wang, H. Mi and A. Ittycheriah, “Semi-supervised clustering for short text via deep representation learning,” in Proc. of the 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, 2016, pp. 31-39.
    [8] “Clustering,” Scikit Learn. https://scikit-learn.org/stable/modules/clustering.html. (accessed Nov. 27, 2023).
    [9] D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding,” in Proc. of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans Louisiana, the U.S., 2007.
    [10] “sklearn.cluster.kmeans_plusplus,” Scikit Learn. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.kmeans_plusplus.html#sklearn.cluster.kmeans_plusplus. (accessed Dec. 7, 2023).
    [11] Akai,〈EM Algorithm 詳盡介紹:利用簡單例子輕鬆讀懂 EM 的原理及概念〉,玩轉部落格。 https://playround.site/?p=628 。(存取日期:2023 年 12 月 31 日)。
    [12] 周子皓,〈基於語境特徵及分群模型之中文多義詞消歧〉,碩士論文,國立政治大學資訊科學研究所,2019年。
    [13] 陳垂呈,黃俊榮,〈利用群組發掘書籍最適性之推薦〉,教育資料與圖書館學,第43卷,第3期,第 309-325 頁,2006年。
    [14] 〈維基百科分類索引〉,維基百科。 https://zh.wikipedia.org/zh-tw/Wikipedia:分類索引 。(存取日期:2023 年 10 月 4 日)。
    [15] M. Majlis, “Wikipedia-API,” Python Software Foundation. https://pypi.org/project/Wikipedia-API/. (accessed Dec. 7, 2023).
    [16] 〈營養作用〉,維基百科。 https://zh.wikipedia.org/zh-tw/营养作用。(存取日期:2023 年 12 月 27 日)。
    [17] 〈評測簡介〉,中國法律智能技術評測。 http://cail.cipsc.org.cn 。(存取日期:2023 年 12 月 20 日)。
    [18] 〈Open Chinese Convert 開放中文轉換〉,Github。 https://github.com/BYVoid/OpenCC 。(存取日期:2023 年 12 月 7 日)。
    [19] 〈反式脂肪〉,維基百科。 https://zh.wikipedia.org/zh-tw/反式脂肪。(存取日期:2023 年 12 月 13 日)。
    [20] “Jieba,” Github. https://github.com/fxsjy/jieba. (accessed Dec. 8, 2023).
    [21] “Clustering text documents using k-means,” Scikit Learn. https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py. (accessed Dec. 8, 2023).
    Description: 碩士
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0109753140
    Data Type: thesis
    Appears in Collections:[Department of Computer Science ] Theses

    Files in This Item:

    File Description SizeFormat
    314001.pdf8776KbAdobe PDF0View/Open

    All items in 政大典藏 are protected by copyright, with all rights reserved.

    社群 sharing

    著作權政策宣告 Copyright Announcement
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback