English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 93218/123590 (75%)
Visitors : 27662270      Online Users : 297
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    政大機構典藏 > 商學院 > 統計學系 > 學位論文 >  Item 140.119/131479
    Please use this identifier to cite or link to this item: http://nccur.lib.nccu.edu.tw/handle/140.119/131479


    Title: 維度縮減於文本風格之應用研究
    A Study of Data Reduction on Text Mining
    Authors: 林志軒
    Lin, Chih-Hsuan
    Contributors: 余清祥
    鄭文惠

    Yue, Ching-Syang
    Cheng, Wen-Huei

    林志軒
    Lin, Chih-Hsuan
    Keywords: 文字探勘
    寫作風格
    資料縮減
    卡方檢定
    交叉驗證
    Text Mining
    Writing Style
    Data Reduction
    Chi-Square Test
    Cross-Validation
    Date: 2020
    Issue Date: 2020-09-02 11:43:26 (UTC+8)
    Abstract: 寫作風格是文字分析的常見議題,無論個人寫作、學術期刊、報章雜誌等,各文本多半都有自己的獨特風格,往往由用詞遣字及編排就能看出差異。寫作風格的量化分析經常透過分類模型,判定文章來自於哪位作者,由於分析時通常會因模型代入過多變數,使得運算時間過長,有些研究提議套用主成份分析之類的資料縮減方法,但如此多半無法具體詮釋文本差異。本文以分類寫作風格為研究目標,藉由卡方檢定等方法篩選相關變數,並與線性、非線性資料縮減方法比較,希冀可兼顧分類準確率及實質詮釋。
    本文使用的文本都屬於白話文,包括臺灣及中國的報刊:2012~2019年《蘋果日報》、《自由時報》、《中國時報》頭條新聞,1971~1975年、1989~1993年《人民日報》頭版新聞,以及1919年、1926年《新青年》第七卷及第十一卷。各文本先經過結巴(jieba)斷詞處理,以倍數指標、卡方檢定等方法挑選變數,再與線性及非線性維度縮減選取變數比較,代入統計學習、機器學習模型,藉由交叉驗證比較分類準確率。分析發現本文提出的卡方檢定篩選方法較為穩定,分類準確率也較高,模型以XGBoost之類集成方法較佳。另外,根據本文挑選出的字詞判斷文本風格,《蘋果日報》、《自由時報》、《中國時報》用詞分別偏向於社會議題、政黨政治及兩岸關係議題,《人民日報》在1970年代、1990年代用詞偏向革命議題、經濟改革等議題,《新青年》第七卷、第十一卷用詞分別偏向於思想改革、資本主義等議題。
    Writing style is a popular research topic in text mining and experts often can judge the authors of articles by checking the use of certain words. In addition to choosing proper words, statistical and machine learning models also are important in the study of writing style. In practice, usually many variables (e.g., words or phrases) are plugged into the models, costing a lot of computation time, and thus data reduction methods are recommended to speeding the analysis. However, it is difficult to give a reasonable interpretation to the variables after data reduction. In this study, we propose two methods for selecting variables, which take into account the accuracy and interpretation of classification models.
    The texts used in this study all belong to modern Chinese writing, including the headlines of Apple Daily, Liberty Times, and China Times (2012-2019), articles of People’s Daily (1971-1975 and 1989-1993), and Volumes 7 and 11 of New Youth Magazine (1919 and 1926). We first apply jieba to all articles for word segmentation, following by performing the variable selection methods (e.g., the proposed methods and linear/nonlinear dimension reduction methods), and finally plug the chosen variables into statistical and machine learning models. The model comparison is based on the F1 measures via cross-validation. We found that the proposed variable selection methods and the ensemble methods generally have the best performance in classification. As for the interpretation of selected variables, Apple Daily, Liberty Times and China Times each focused on issues related social affair, politics and cross-strait relationship, respectively. People’s Daily emphasized on topics related to revolution and economic reform in 1970’s and 1990’s, respectively. New Youth Magazine focused issues related to ideological reform and capitalism in Volumes 7 and 11, respectively.
    Reference: 一、中文文獻
    1.李竹君(2016)。「再思考新聞價值—以蘋果日報與中時集團的即時新聞為例」,台灣大學新聞研究所碩士論文。
    2.宋長熾(2004)。「兩岸報紙對「2003年美伊戰爭」議題報導之研究-以《中國時報》、《聯合報》、《自由時報》、《人民日報》為例」,政治作戰學校新聞研究所碩士論文。
    3.余清祥、葉昱廷(2020)。「以文字探勘技術分析臺灣四大報文字風格」,《數位典藏與數位人文》,第6卷。
    4.陳美瑜(2013)。「中文文本作者辨識研究: 以社群網站--臉書為例」,臺灣師範大學英語學系碩士論文。
    5.黃于珊(2017)。「文字探勘在總體經濟上之應用-以美國聯準會會議紀錄為例」。政治大學金融學系碩士論文。
    6.黃培軒(2017)。「關鍵詞與階層式詞彙文本分群之應用」,政治大學統計學系碩士論文。
    7.鄭開元(2018)。「基於詞頻、位置及類別關係的特徵選擇方法」,銘傳大學資訊管理學系碩士論文。

    二、英文文獻
    1.Bishop, C. (2006). Machine Learning and Pattern Recognition, Cambridge University Press.
    2.Boyce, G., Curran, J. and Wingate, P. (Eds.) (1978). Newspaper History from the 17th Century to the Present Day, Acton Society, Press group.
    3.Chuan, H., Zhe, D., Ruifan, L. and Yixin Z. (2008). Dimensionality Reduction for Text Using LLE, Beijing, China.
    4.Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge, Cambridge University Press.
    5.Archer, J. and Jockers, M.L. (2016). The Bestseller Code, New York: St. Martin’s Press.
    6.Jolliffe, I.T. (2002) Principal Component Analysis, 2 edition, Springer, New York.
    7.Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X. and Chen, E. (2015). “Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective,” Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, AAAI Press: 3650-3656.
    Description: 碩士
    國立政治大學
    統計學系
    107354025
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0107354025
    Data Type: thesis
    DOI: 10.6814/NCCU202001336
    Appears in Collections:[統計學系] 學位論文

    Files in This Item:

    File Description SizeFormat
    402501.pdf5986KbAdobe PDF0View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback