政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/103991
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 109951/140892 (78%)
Visitors : 46216414      Online Users : 915
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/103991


    Title: 適用於中文史料文本之標記式主題模型分析方法研究
    An Enhanced Topic Model Based on Labeled LDA for Chinese Historical Corpora
    Authors: 陳奕安
    Contributors: 蔡銘峰
    陳奕安
    Keywords: 主題模型
    標記式主題模型
    隱含狄利克雷分布
    Date: 2016
    Issue Date: 2016-11-14 16:15:00 (UTC+8)
    Abstract: 本論文提出了一個適用於中文史料文本主題分析方法,主要是根據標記式隱含狄利克雷分布(Labeled Latent Dirichlet Allocation,LLDA) 演算法,使其可以透過人工標記的中文文本找出特定主題的相關詞彙。在我們提出的演算法中,我們加上主題種子字詞(Seed Words) 資訊,以增強 LDA 群聚過後的結果,使群聚過後的詞彙與主題的關聯度能夠獲得提昇。近年來,隨著網際網路的普及以及資訊檢索的蓬勃發展,同時由於數位典藏的資料成長,越來越多的實體書藉被編輯成數位版本並且加上後設資料(Metadata),在取得這些富有價值的 歷史文本資料後,如何利用文字探勘技術(Text Mining)在這些資料上變成一項重要的研究議題。其中,如何從大量文本史料中辨識出文章主題更是許多學者感興趣的方向,而 LDA 主題模型則是在文字探勘領域中非常經典的方法。在此研究中我們發現傳統 LDA 對於群聚後的主題描述存在些許問題,包括主題類別的高隨機性以及個別主題的低易讀性,使得後續的解讀工作變得十分困難,因此我們採用了由 LDA 衍生出的標記式主題模型 Labeled LDA 演算法,限定能夠產生的主題類別以降低期隨機性,此外我們還加入了考量中文字詞的長度以及自定義的相關種子字詞等改進,使群聚出的主題詞彙能夠與主題更加相關,更加容易描述。實驗部分,我們利用改良後的演算法提取出主題詞彙,並進行人工標記,接著將標記的結果作為正確解答來計算平均準度均值(Mean Average Precision,MAP)等資訊檢索之評估方法作為評估,結果證實以長字詞以及種子字詞為考量所群聚出的結果皆優於傳統主題模型所群聚出的結果;此外,我們也將最終的結果與 TF-IDF 權重計算後的字詞進行比較,並由實驗結果可見其兩者之間的差異性。
    This paper proposes an enhanced topic model based on Labeled Latent Dirichlet Allocation (LLDA) for Chinese historical corpora to discover words related to specific topics. To enhance the traditional LDA performance and to increase the readability of its clustered words, we attempt to use the infor- mation of seed words and the Chinese word length into the traditional LDA algorithm. In this study, we find that the traditional LDA exists some prob- lems about topic descriptions after clustering. We therefore apply the Labeled LDA algorithm, which is derived from traditional LDA, with the proposed improvements of considering the lengths of the words and related seed words. In our experiments, Mean Average Precision (MAP) is used to evaluate our experiment results based on the topics words labeled manually by historical experts. The experimental results shows that the proposed method of consid- ering both Chinese word length information and seed words is better than the traditional LDA method. In addition, we compare the proposed results with the TF-IDF weighting scheme, and the proposed method also outperforms the TF-IDF method significantly.
    Reference: [1] I. Bhattacharya. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 6th SIAM International Conference on Data Mining, volume 124, page 47. SIAM, 2006.
    [2] I. B ́ıro ́, J. Szabo ́, and A. A. Benczu ́r. Latent dirichlet allocation in web spam filter- ing. In Proceedings of the 4th international workshop on Adversarial Information Retrieval on the Web, pages 29–32. ACM, 2008.
    [3] D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
    [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
    [5] K.-Y. Chen and B. Chen. 主題語言模型於大詞彙連續語音辨識之研究 (on the use of topic models for large-vocabulary continuous speech recognition)[in chinese]. In Proceedings of the 2009 ROCLING, pages 179–194, 2009.
    [6] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In Proceedings of the 2005 IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition (CVPR’05), volume 2, pages 524–531. IEEE, 2005.
    [7] T. L. Griffiths and M. Steyvers. Finding scientific topics. Journal of Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004.
    [8] G. E. Hinton and T. J. Sejnowski. Unsupervised Learning: Foundations of Neural Computation. 1999.
    [9] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and Development in In- formation Retrieval, pages 50–57. ACM, 1999.
    [10] R. A. Horn. The hadamard product. In Proceedings of Symposia in Applied Mathe- matics, volume 40, pages 87–169, 1990.
    [11] R. V. Lindsey, W. P. Headden III, and M. J. Stipicevic. A phrase-discovering topic model using hierarchical pitman-yor processes. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computa- tional Natural Language Learning, pages 214–222. Association for Computational Linguistics, 2012.
    [12] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent seman- tic indexing: A probabilistic analysis. In Proceedings of the 17th ACM SIGACT- SIGMOD-SIGART symposium on Principles of database systems, pages 159–168. ACM, 1998.
    [13] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 248–256. Association for Computational Linguistics, 2009.
    [14] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their localization in images. In Proceedings of the 10th IEEE Inter- national Conference on Computer Vision (ICCV’05) Volume 1-Volume 01, pages 370–377. IEEE Computer Society, 2005.
    [15] Y. W. Teh. A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 985–992. Association for Computational Linguistics, 2006.
    [16] X. Wang, A. McCallum, and X. Wei. Topical n-grams: Phrase and topic discov- ery, with an application to information retrieval. In Proceedings of the 7th IEEE International Conference on Data Mining, pages 697–702. IEEE Computer Society, 2007.
    [17] X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In Pro- ceedings of the 29th annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 178–185. ACM, 2006.
    [18] D. Xing and M. Girolami. Employing latent dirichlet allocation for fraud detection in telecommunications. Journal of Pattern Recognition Letters, 28(13):1727–1734, 2007.
    [19] L. Yao, Y. Zhang, B. Wei, W. Wang, Y. Zhang, X. Ren, and Y. Bian. Discov- ering treatment pattern in traditional chinese medicine clinical cases by exploiting supervised topic model and domain knowledge. Journal of Biomedical Informatics, 58(C):260–267, 2015.
    [20] 孟海濤, 陳思, and 周睿. 基于 lda 模型的 web 文本分類. 鹽城工學院學報 (自然 科學版), 22(4):56–59, 2009.
    [21] 賈西平, 彭宏, 鄭啟倫, 石時需, and 江焯林. 基于主題的文檔檢索模型. 華南理 工大學學報 (自然科學版), 36(9):37–42, 2008.
    Description: 碩士
    國立政治大學
    資訊科學學系
    102753031
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0102753031
    Data Type: thesis
    Appears in Collections:[Department of Computer Science ] Theses

    Files in This Item:

    File SizeFormat
    303101.pdf11045KbAdobe PDF21606View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback