政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/125515

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 109952/140903 (78%)
Visitors : 46045258 Online Users : 969

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 商學院 > 統計學系 > 學位論文 > Item 140.119/125515

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/125515

Title:	關鍵詞偵測方法的比較與應用 The Application of Keywords Extraction
Authors:	許承恩 Hsu, Cheng-En
Contributors:	余清祥鄭文惠 Yue, Ching-Syang Cheng, Wen-Huei 許承恩 Hsu, Cheng-En
Keywords:	文字探勘關鍵字擷取數位人文機器學習詞頻與文本頻率 Text mining Keyword extraction Digital humanities Machine learning Term Frequency Inverse Document Frequency
Date:	2019
Issue Date:	2019-09-05 15:41:42 (UTC+8)
Abstract:	近年來由於文本被大量數位化，使得文字探勘（Text Mining）成為熱門研究領域，愈來愈多研究藉由量化技術找出文字涵意，提供專家意見不同角度的語意解讀。文本在經過結構化（Structurization）後，根據不同需求如關鍵詞擷取、尋找潛在文本主題、情感分析、輿情分析等，建立統計及機器學習等數位模型。其中關鍵詞擷取可用於解讀作者想法、提升閱讀效率、掌握寫作風格以及文章出版時空背景的變化。本研究也以決定關鍵詞為研究目標，除了提出一種非監督學習的統計方法，也使用中文文本評估新方法與幾種常見關鍵詞偵測方法，包括網路流行的TF-IDF (Term Frequency Inverse Document Frequency；詞頻與文本頻率)、統計分析的羅吉斯迴歸（Logistic Regression）、常見的機器學習模型。實證分析採用《人民日報》、《新青年雜誌》兩個白話文的文本，其中《人民日報》為1971-1989年與人權有關的514篇報導，《新青年》則是第七卷（1919年）、第八卷（1920年），這些文本的篇幅大約都介於40～60萬字。先由人文學者標記出各文本的關鍵詞，將其視為標準答案，再套用上述三種方法選取可能的關鍵詞，再比較上述方法與專家意見的差異及準確率；另外，我們也將比較人工挑選、自動挑選關鍵詞的差異，並探索兼具兩種方法優點的可能。 Text Mining has become one of the popular research areas after the IBM proposed the term Big Data in 2010. Since then many texts are being digitalized and more scholars are devoted in developing quantitative tools for giving texts semantic meaning without the help of human experts. This greatly increases the efficiency of reading a hugh amount of texts provided that the texts are properly structurized. The structurization of texts includes quite a few steps, such as keyword extraction and sentiment analysis. The keyword extraction is critical and the keywords can be used to summarize an article and compare two authors’ writing styles. The goal of this study is to propose a new unsupervised method for extracting keywords and compare it to some frequently used methods, including term frequency inverse document frequency (TF-IDF), logistic regression, machine learning models. In the empirical analysis, we considered three modern Chinese texts, one from People’s Daily (514 articles in 1971-1989) and two from New Youth Magazine (volumes 7 and 8 in 1919-1920). The numbers of words in all texts are approximately 400,000 to 600,000. We asked historical scholars to pick up keywords from these three texts and treat them as the true keywords. Then, we applied different keyword extraction methods to these texts and compared their results. We found that the proposed method has the best performance among all supervised methods and it is competitive to the supervised methods.
Reference:	一、中文文獻 1. 何昱鋒(2019)，「基於物聯網之即時環境監測系統」，碩士論文，國立臺灣海洋大學電機工程學系。 2. 何立行、余清祥、鄭文惠(2014)，「從文言到白話：《新青年》雜誌語言變化統計研究」，東亞觀念史集刊，第七期，頁427-454。 3. 金觀濤、梁穎誼、姚育松、劉昭麟(2014)，「統計偏離值分析於人文研究上的應用」，東亞觀念史集刊，第六期，頁331-366。 4. 黃居仁(2005)，「漢字知識表達的幾個層面：字、詞與詞義關係概論」，漢字與全球化國際學術研討會論文集，頁77-88。 5. 郭益豪(2013)，「以改良式N-Gram斷詞法結合潛在語意分析進行以改良式N-Gram斷詞法結合潛在語意分析進行網頁影像加註」.，碩士論文，國立雲林科技大學資訊管理系。 6. 謝孟樺(2018)，「考量上下文字詞共現關係之短文斷詞研究」，碩士論文，國立中興大學資訊科學與工程學系。 7. 梁家安(2016)，「從國共內戰到改革開放：人民日報風格變遷之量化研究」，碩士論文，國立政治大學統計研究所。 8. 謝博行(2013)，「局部最長連續共同子序列與新詞組收集」，碩士論文，國立清華大學統計學研究所。 9. 潘豔豔(2015)，「探索性資料分析方法在文本資料中的應用─以《新青年》雜誌為例」，碩士論文，國立政治大學統計研究所。 50 二、英文文獻 1. Demets, D.L. and Lan, K.G. (1994). “Interim analysis: the alpha spending function approach.” Statistics in Medicine, 13(13‐14): 1341-1352. 2. Hinton, G.E. and Roweis, S.T. (2003). “Stochastic neighbor embedding.” Advances in neural information processing systems, 857-864. 3. Kulldorff, M. (1997). “A spatial scan statistic.” Communications in Statistics-Theory methods, 26(6): 1481-1496. 4. Pocock, S.J. (1977). “Group sequential methods in the design and analysis of clinical trials.” Biometrika, 64(2): 191-199. 5. Salton, G., Wong, A., and Yang, C.S. (1975). “A vector space model for automatic indexing.” Communications of the ACM, 18(11): 613-620. 6. van der Maaten, L. and Hinton, G. (2008). “Visualizing data using t-SNE.” Journal of machine learning research, 9(Nov): 2579-2605.
Description:	碩士國立政治大學統計學系 106354020
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0106354020
Data Type:	thesis
DOI:	10.6814/NCCU201900953
Appears in Collections:	[統計學系] 學位論文

Files in This Item:

File	Size	Format
402001.pdf	2877Kb	Adobe PDF2	0	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback