政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/89066

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 116264/147299 (79%)
Visitors : 60057126 Online Users : 2048

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/89066

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/89066

Title:	中文文本探勘工具：主題分析、詞組關聯強度、相關句擷取 Tools for Chinese Text Mining: Topic Analysis, Association Strengths of Collocations, Extraction of Relevant Statements
Authors:	林書佑 Lin, Shu Yu
Contributors:	劉昭麟 Liu, Chao Lin 林書佑 Lin, Shu Yu
Keywords:	文本探勘主題分析詞組關聯強度相關句擷取 Text Mining Topic Analysis Association Strengths of Collocations Extraction of Relevant Statements
Date:	2016
Issue Date:	2016-05-02 13:55:23 (UTC+8)
Abstract:	現今資料大量且快速數位化的時代，各領域對資訊探勘分析技術越趨倚重。而在數位人文中領域中從2009年「數位典藏與數位人文國際研討會」開始，此議題逐漸受到重視，主要目的為將數位文物結合資訊分析與圖像化輔助，透過不同層面的詮釋建構出更完整的文物資訊。本研究建構一個針對各種中文語料分析的工具，藉由latent semantic analysis、pointwise mutual information、Person’s chi-squared test、typed dependencies distance、word2vec、Gibbs sampling for latent Dirichlet allocation等計算語料中關鍵詞彙關聯強度的方法，並結合分群方法找出可能的主題，最後擷取符合分群結果的相關句子予以輔助人文學者分析詮釋。透過提供各種觀察語料的面向，進而提升語料相關研究學者的效率。我們利用《人民日報》、《新青年》、《聯合報》、《中國時報》作為實驗與測試的中文語料。且將《新青年》藉由此套工具分析後的結果提供給專業人文學者，做為分析詮釋的參考資訊與佐證依據，並在「2015年數位典藏與數位人文國際研討會」中發表論文。目前我們透過各種中文語料評估工具的效能，且在未來將公開此套工具提供給更多學者使用，節省對於語料分析的時間。 In recent years, a wide variety of text documents have been transformed into digital format. Hence, using data mining techniques to analyze data is becoming more and more popular in many research fields. The digital humanities gradually have taken seriously since "International Conference of Digital Archives and Digital Humanities" began in 2009. The main purpose of the digital heritage combined with information analysis and visualization could improve the effectiveness of cultural information through different levels of interpretation. In this study, we construct a set of tools for Chinese text mining, calculating associated strengths of collocations work through latent semantic analysis, pointwise mutual information, Person’s chi-squared test, typed dependencies distance, word2vec, and Gibbs sampling for latent Dirichlet allocation etc. The tools employ clustering method to identify the possible topics, meanwhile, the tools will extract the relevant statements according to the clustering results. These clustering and relevant statements contribute and improve the efficiency of humanities scholars’ analysis through providing a variety of observations about the corpora. At the experimental stage of this study, we considered the "People`s Daily", "New Youth", "United Daily News", and "China Times" as as the corpora for testing. Among the research, humanities scholars analyzed the "New Youth" by the tools and published a paper in the "2015 International Conference of Digital Archives and Digital Humanities". Currently, we assess the effectiveness of the tools through a variety of Chinese corpora. In the future, we will make the tools freely available on the Internet for Chinese text mining. We hope these time-saving tools can assist in humanities scholars’ study of Chinese corpora.
Reference:	[1] 人民日報，http://paper.people.com.cn/。 [2] 中國近現代思想及文學史專業數據庫文獻簡介，http://digibase.ssic.nccu.edu.tw/?m=2302&wsn=0300。 [3] 台灣數位人文小小讚，https://sites.google.com/site/taiwandigitalhumanities/。 [4] 金觀濤。數位人文研究的理論基礎，數位人文研究的新視野：基礎與想像，項潔編，45-61，臺灣大學出版中心，臺灣，2011。 [5] 金觀濤、邱偉雲、梁穎誼、陳柏聿、沈錳坤、及劉青峰。觀念群變化的數位人文研究-以《新青年》為例，2014第五屆數位典藏與數位人文國際研討會，臺灣，2014。 [6] 金觀濤、邱偉雲、及劉昭麟。「共現」詞頻分析及其運用─以「華人」觀念起源為例，2011年第三屆數位典藏與數位人文國際研討會論文集，199-223，臺灣，2011。 [7] 項潔、翁稷安。導論―關於數位人文的思考：理論與方法，數位人文研究的新視野：基礎與想像，項潔編，臺灣大學出版中心，9-18，臺灣，2011。 [8] 新青年簡介，http://zh.wikipedia.org/zh-tw/新青年。 [9] 劉昭麟、金觀濤、劉青峰、邱偉雲、及姚育松。自然語言處理技術於中文史學文獻分析之初步應用，2011第三屆數位典藏與數位人文國際研討會論文集，151-168，臺灣，2011。 [10] John Aldrich. R.A. Fisher and the making of maximum likelihood 1912-1922, Statistical Science, 162-176, 1997. [11] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 993–1022, 2003. [12] Lee-Feng Chien. PAT-tree-based adaptive keyphrase extraction for intelligent chinese information retrieval, Information Processing and Management, 501-521, 1999. [13] Kenneth Ward Church , Patrick Hanks. Word association norms, mutual information, and lexicography, Compute Linguist , 22–29, 1990. [14] Garry A. Einicke. Smoothing, Filtering and Prediction: Estimating the Past, Present and Future, InTech, 2012. [15] George William Furnas, Scott Deerwester, Susan T. Dumais, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis, Journal of The American Society for Information Science, 391—407, USA,1990. [16] Jiawei Han, Micheline Kamber, Morgan Kaufmann. Data Mining: Concepts and Techniques, Morgan Kaufmann, 2012. [17] Trevor John Hastie, Robert Tibshirani. Generalized Additive Models, Chapman & Hall/CRC, 1990. [18] JAMA, http://math.nist.gov/javanumerics/jama/ [19] Leonard Kaufman, Peter J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis, WILEY, 2005. [20] Chao-Lin Liu, Guantao Jin, Qingfeng Liu, Wei-Yun Chiu, and Yih-Soong Yu. Some chances and challenges in applying language technologies to historical studies in chinese, International Journal of Computational Linguistics and Chinese Language Processing, 27‒46, 2011. [21] Yang Liu, Minghui Qiu, Swapna Gottipati, Feida Zhu, Jing Jiang, Huiping Sun, and Zhong Chen. CQARank: Jointly Model Topics and Expertise in Community Question Answering. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, USA, 2013. [22] Christopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press, 1999. [23] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, In Proceedings of Workshop at ICLR, 2013. [24] PAT Tree , http://www.openfoundry.org/of/projects/367/. [25] Karl Pearson , http://en.wikipedia.org/wiki/Karl_Pearson. [26] SRI Language Modeling ( SRILM ) , http://www.speech.sri.com/projects/srilm/. [27] Stanford Part-Of-Speech Tagger, http://nlp.stanford.edu/software/tagger.shtml. [28] Stanford Type Dependencies , http://nlp.stanford.edu/software/lex-parser.shtml. [29] Stanford Word Segmenter , http://nlp.stanford.edu/software/segmenter.shtml. [30] Lloyd N. Trefethen, David Bau, III. Numerical linear algebra, Siam, 1997. [31] WEKA , http://www.cs.waikato.ac.nz/ml/weka/. [32] Xiao-guang Wang, Mitsuyuki Inaba. Structure and evolution of digital humanities: empirical research based on correspondence and co-word analyses, 從保存到創造：開啟數位人文研究，97-112，臺北：國立臺灣大學出版中心，2011。
Description:	碩士國立政治大學資訊科學學系 102753020
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0102753020
Data Type:	thesis
Appears in Collections:	[資訊科學系] 學位論文

Files in This Item:

File	Size	Format
302001.pdf	3880Kb	Adobe PDF2	786	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback