政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/99804
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 113100/144073 (79%)
Visitors : 50561992      Online Users : 831
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/99804


    Title: 應用序列標記技術於地方志的實體名詞辨識
    Named Entity Recognition in Difangzhi Using Sequential Labeling Techniques
    Authors: 黃致凱
    Huang, Chih Kai
    Contributors: 劉昭麟
    Liu, Chao Lin
    黃致凱
    Huang, Chih Kai
    Keywords: 文字探勘
    實體名詞辨識
    機器學習
    數位人文
    Text Mining
    Named Entity Recognition
    Machine Learning
    Digital Humanity
    Date: 2016
    Issue Date: 2016-08-09 11:24:27 (UTC+8)
    Abstract: 地方志是中國過去由官方編輯的地方記事的資料,其內容包含廣泛,包含人物傳記、地理環境、任官紀錄等等,從中包含了很多現在還沒被整理出的人、事、物,由於地方志文本使用的詞彙、語法架構與現今的中文有相當大的差異,且文本中大多數沒有標點符號,所以面對的是沒有經過斷詞、斷句、斷段落的序列文字資料,所以並不適用現有的自然語言處理工具來做處理分析。因此,本研究針對地方志類型的資料去建立對應的實體名詞辨識模型,以序列標記方式標記出人名與地名的資訊,以及加入官職、入仕、年號以及日期等標記資訊,透過標記資料去從中找出更多中國古代人物的資料。
    本研究透過監督式學習的方式去做機器學習來產生序列標記模型,首先從過去整理好的地方志中的人物資訊,抽取人、地名的資訊,並配合已知的名詞表來標記過去曾處理過的地方志語料,即使透過人工整理,過去所整理的資料還是有不正確的地方,這裡先經由前處理對資料都進一步的整理,然後標記時會產生歧義性的問題,我們提出了三種方法來進行標記,來解決歧義問題,並透過條件隨機場作為序列標記模型,同時配合名詞表、規則去做預先標記。透過實驗,去對未處理過的地方志語料做實體名詞辨識,辨識人名準確率皆可達到80%以上,另外再地名辨識部分可達到86%,能有如此好的辨識效果主因在於整理好的地方志語料與實驗語料之間敘述及記錄方式相似度是相當高的。運用標記的結果,試著用簡易的方法來做連結人名與地名資訊的實驗,找出語料中的人地名關聯資料,取樣作人工驗證,取樣結果說明我們的方法能有效的連結特定語法下的人名與地名;為了在未來的研究中,能夠做更深入的研究,嘗試從文本中切割出人物條目,運用地方志已知的特性,配合有限狀態機模型來判斷是否為條目開頭,雖能找出部分開頭,但會有許多遺漏狀況。
    在未來的研究中,試著加入更多類型的標記,並做更完善的標記設計,讓辨識效果能有更多的提升,同時為了抓出更精確的人物資訊,除了嘗試段落切割、斷句之外,將試著做地方志的語法分析,確實的抓出語法結構來做人物與其他實體名詞的連結,自動化去整理出更完善的人物資訊。
    Difangzhi is the local gazetteers compiled by local government of China. Its content is plenty and extensive. It’s including many undetected information, like biographical information, geographical information, and officer record information and so on. Because of the difference between Difangzhi corpus and modern Chinese language, we should not use current natural language processing tools directly. In order to extract biographical information, we construct our model to recognize the named entity and use the noun list to assist our annotation method in Difangzhi corpus.
    In this study, we use supervised learning to construct our model. At first, we need to generate our training data. According to the personal information list with manual verification and noun lists, we have reliable information to annotate words in Difangzhi corpus. However, they still have some noise in those lists. As a result, we must do the preprocessing to those lists for cleaning. After, the ambiguity problem will happen when we trying to annotate our corpus. Here we provide three methods to annotate our corpus with disambiguation. Using the annotated corpus to generate training data and built the condition random fields models. In our experiment, we use our models generated by three different annotate methods to predict the character label in testing Difangzhi corpus. According to the labeled result, we extract the person name and address name to evaluate. The result shows the precision of person name recognition is over 80%, and precision of address name recognition is about 86%. Because of the training corpus and test corpus is quite similar, the performances of our model is pretty well. Therefore, we use labeled result to find correlation of person name and address name. Using a simple way to connect person name and address name and sampling the result to evaluate. The sample result shows we could connect person name and address name correctly in some specific grammars. In order to analyze more deeply, we attempt to split clauses in Difangzhi corpus. Use finite state machine model to recognize the beginning of clauses. Although the result shows we could find some beginning of clauses, but our method still lose many beginning of clauses.
    In the future work, we attempt to add more information to annotate Difangzhi corpus and modify our disambiguated methods to make the recognition result better. In order to get more information about the person in the corpus, we will try to split paragraphs or sentences more precisely. Besides, we also try to analyze grammar in the corpus. Finding useful pattern to connect person name and other entities, like address name, officer name and so on. Generating the information about people appears in the corpus automatically.
    Reference: [1] 中國歷代人物傳記資料庫,http://projects.iq.harvard.edu/chinesecbdb/home [last visited 2016/7/26] 。
    [2] 地方志介紹,http://baike.baidu.com/view/143397.htm [last visited 2016/6/17]。
    [3] 杜協昌,半自動詞彙擷取:簡化的詞夾子方法以及JavaScript元件開發及應用,第六屆數位典藏與數位人文國際研討會論文集,391-418,2015。
    [4] 金觀濤、邱偉雲、劉昭麟,「共現」詞頻分析及其運用-以「華人」觀念起源為例,第三屆數位典藏與數位人文國際研討會論文集,199-223,2011。
    [5] 異體字介紹,https://zh.wikipedia.org/wiki/異體字 [last visited 2016/6/18]。
    [6] 異體字整理表,http://www.china-language.gov.cn/wenziguifan [last visited 2016/6/18]。
    [7] 張尚斌,詞夾子演算法在專有名詞辨識上的應用──以歷史文件為例,國立台灣大學,碩士論文,2006。
    [8] 陳叔倬、李其原、C. Isett、S. Morgan,18世紀中國常民的身高分布、營養、與福利-初步分析報告,第三屆數位典藏與數位人文國際研討會論文集,83-93,2011。
    [9] 彭維謙、劉士綱、杜協昌、翁稷安、項潔,自動擷取中文典籍中人名之嘗試 ──以PMI斷詞於《資治通鑑》的應用為例,數位人文研究與技藝,國立台灣大學出版中心,139-163,2012。
    [10] 劉吉軒、柯雲娥、張惠真、譚修雯、黃瑞期、甯格致,以文本分析呈現臺灣海外史料政治思想輪廓,第三屆數位典藏與數位人文國際研討會論文集,169-198,2011。
    [11] K. Black, Sampling and Sampling Distributions, Business Statistics for Contemporary Decision Making, 216-241, Wiley, 2009.

    [12] K.-J. Chen and S.-H. Liu, Word Identification for Mandarin Chinese Sentences, Proceedings of International Conference on Computational Linguistics, 101-107, 1992.
    [13] I. S. Dhillon and D. S. Modha, Concept Decompositions for Large Sparse Text Data Using Clustering, Journal of Machine Learning, 42(1-2), 143-175, 2001.
    [14] R. Grossma, G. Seni, J. Elder, N. Agawal and H. Liu, Model Complexity, Model Selection and Regularization, Ensemble Methods in Data Mining, Improving Accuracy Through Combining Predictions, 21-38, Morgan and Claypool, 2010.
    [15] R. Grishman and B. Sundheim, Sixth Message Understanding Conference: A Brief History, Proceedings of the 16th Conference on Computational linguistics, 466-471, 1996.
    [16] J. Lafferty, A. McCallum and F. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the 8th international conference on machine learning, 282-289, 2001.
    [17] C.-L. Liu, G.-T. Jin, Q.-F. Liu, W.-Y. Chiu and Y.-S. Yu, Some Chances and Challenges in Applying Language Technologies to Historical Studies in Chinese, Journal of Computational Linguistics and Chinese Language Processing, 16(2), 27‒46, 2011.
    [18] A. K. McCallum, MALLET: A Machine Learning for Language Toolkit, http://mallet.cs.umass.edu, 2002.
    [19] W.-H. Pang, S.-P. Chen and H. Cheng, Extracting Posting Data from Chinese Local Monographs, Proceedings of International Conference of Digital Archives and Digital Humanities, 94-116, 2012.
    [20] C. Sutton and A. McCallum, An Introduction to Conditional Random Fields for Relational Learning, Introduction to Statistical Relational Learning, 93-127, MIT Press, 2006.
    [21] X.-G. Wang and M. Inaba, Structures and Evolution of Digital Humanities: An Empirical Research based on Correspondence Analysis and Co-word Analysis, Proceedings of International Conference of Digital Archives and Digital Humanities, 1-16, 2009.
    [22] Y.-H. Wu, J. Zhao, B. Xu and H. Yu, Chinese Named Entity Recognition Based on Multiple Features, Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 427–434, 2005.
    [23] H.-P. Zhang and Q. Liu, Model of Chinese Words Rough Segmentation Based on N-Shortest Paths Method, Journal of Chinese Information Processing, 1-7, 2002.
    [24] H.-P. Zhang, Q. Liu and H.-K. Yu, Chinese Named Entity Recognition Using Role Model, Journal of Computational Linguistics and Chinese Language Processing, 8(2), 29-60, 2003.
    [25] Y. Zhai, Z. Rasheed and M. Shah, Conversation Detection in Feature Films Using Finite State Machines, Proceedings of 17th International Conference on Pattern Recognition, 458-461, 2004.
    Description: 碩士
    國立政治大學
    資訊科學學系
    102753029
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0102753029
    Data Type: thesis
    Appears in Collections:[Department of Computer Science ] Theses

    Files in This Item:

    File SizeFormat
    302901.pdf2266KbAdobe PDF2456View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback