政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/125516
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  全文笔数/总笔数 : 109952/140887 (78%)
造访人次 : 46306041      在线人数 : 1171
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜寻范围 查询小技巧:
  • 您可在西文检索词汇前后加上"双引号",以获取较精准的检索结果
  • 若欲以作者姓名搜寻,建议至进阶搜寻限定作者字段,可获得较完整数据
  • 进阶搜寻
    政大機構典藏 > 商學院 > 統計學系 > 學位論文 >  Item 140.119/125516


    请使用永久网址来引用或连结此文件: https://nccur.lib.nccu.edu.tw/handle/140.119/125516


    题名: 以文字探勘技術分析台灣四大報文字風格
    A Case Study of Text Mining on Taiwan’s Newspapers
    作者: 葉昱廷
    Ye, Yu-Ting
    贡献者: 余清祥
    鄭文惠

    Yue, Ching-Syang
    Cheng, Wen-Huei

    葉昱廷
    Ye, Yu-Ting
    关键词: 寫作風格
    相似指標
    台灣四大報
    探索性資料分析
    社會網路分析
    Writing Style
    Similarity Index
    Taiwan’s Newspaper
    Exploratory Data Analysis
    Social Network
    日期: 2019
    上传时间: 2019-09-05 15:41:55 (UTC+8)
    摘要: 如同作者的寫作風格,即使主題相同,因為切入角度、用詞鋪陳等因素,各報紙的新聞報導經常有明顯差異,從報導文章中往往可判斷來自於哪一個媒體。本文也以研究報紙報導為目標,透過相似指標、多變量分析等文字探勘統計方法,在不考量文字意義、只著重用字頻率的前提下,比較台灣四大報紙的《蘋果日報》、《自由時報》、《聯合報》、《中國時報》的文字風格,資料期間為2012年至2018年。為避免報導題材造成的干擾,資料分析時根據各大報每天的頭版報導,其中受限於資料下載的限制,頭版標題為四大報,但內文比較僅有《蘋果日報》、《自由時報》兩家報紙。
    透過探索性資料分析及Jaccard、Yue指標衡量相似程度,評估四大報頭版頭條間的用字風格,分析顯示四大報在標題用詞上確實存在差異。以頭版標題而言,先計算四大報的用字的相似指標數值,再藉由t-SNE與廣義相關圖(GAP)分群視覺化,發現Jaccard和Yue指標提供不同角度的分群結果,前者傾向於將同時期的各報放在同一群,後者則是將四大報分成三群。頭版內文分析以詞向量為基礎,《自由時報》及《蘋果日報》的用字可對應到5到6個題材領域:《自由時報》題材傾向政治議題,《蘋果日報》傾向社會新聞議題。
    將《自由時報》及《蘋果日報》2012年到2017年用詞次數高於50次,且差異2倍以上的高頻詞作為分類變數,用於預測2018年的頭條內文屬於《自由時報》或《蘋果日報》,機器學習模型(如:SVM)的預測準確率達95.35%。另外,分析發現《自由時報》偏向政治議題的詞彙,《蘋果日報》則傾向社會新聞的用詞,統計分析確實能夠區分兩大報紙頭條內文上的文字風格。
    Like an author’s writing style, every newspaper has its own opinion and narrative methods, and it can be easily distinguished just by reading its articles. In this study, our goal is to explore the news reporting styles of Taiwan’s four major newspapers (Apple Daily, Liberty Times, United Daily News and China Times) and compare their differences. We choose the headline news for analysis in order to prevent the influence of nuisance factors, such as differences in political positions and target audience. The newspaper headlines considered are between 2012 and 2017. The titles of headlines can be downloaded for all four newspapers but the content of headlines is available only for Apple Daily and Liberty Times.
    We first applied the methods of Exploratory Data Analysis (EDA), such as Jaccard and Yue index, for the word frequencies and word types to evaluate the similarities between four newspapers. In addition, we also considered multivariate tools, including t-SNE (t-distributed Stochastic Neighbor Embedding), GAP (Generalized Association Plots), Cluster Analysis, and Neural Network. We plugged the similarity indices into these multivariate tools to visualize the differences of newspapers and to classify observations into different groups.
    For the analysis of headline titles and contents, the results show that there are significant differences in word usage between four newspapers. However, the grouping results of titles and contents based on similarity indices are quite different. For the headline titles, the Jaccard indices grouped titles by time and the Yue indices grouped titles by the media (i.e., 3 groups). For the headline contents, the words used in Apple Daily and Liberty Times, can be classified into five or six classes of topics, with Liberty Times emphasizing political terms and Apple Daily focusing social affairs and crime problems. We also applied machine learning methods to distinguish headline articles of Apple Daily and Liberty Times via cross-validation, treating the data of 2012-2017 as training set and those of 2018 as testing set. Support Vector Machine (SVM) achieved 95.35% accuracy in prediction with 3,316 variables.
    參考文獻: 一、 中文文獻
    1. 張筱涵(2009)「2008年北京奧運期間兩岸報紙呈現中國國家形象之研究—以自由時報、《人民日報》為例」。輔仁大學大眾傳播學研究所。
    2. 楊佳寧(2011)「解讀報紙中的「大陸遊客」—以《自由時報》、《聯合報》為例」。政治大學新聞研究所。
    3. 楊堯為(2014)「平面媒體對太陽花事件報導之內容分析-以《聯合報》、《中國時報》、《自由時報》、《蘋果日報》為例」。政治大學國家發展研究所。
    4. 鄧孟涵(2004)「中共領導人之媒體形象研究(2001-2004):以中國時報與《人民日報》為例」。淡江大學中國大陸研究所。
    5. 蔡貴如(2008)「語言與政治立場:臺灣電視新聞之分析」。臺灣師範大學英語學系。
    6. 蔡佳青(2006)「八面玲瓏:台灣蘋果日報政治立場之初探」。臺北大學社會學系。

    二、 英文文獻
    1. Boyce, G., Curran, J. and Wingate, P. (Eds.) (1978). Newspaper History from the 17th Century to the Present Day, Acton Society, Press group.
    2. Cryer, J.D. and Chan, K. (2008). Time Series Analysis with Applications in R, Springer-Verlag New York.
    3. Chen, C.H. (2002). “Generalized Association Plots for Information Visualization: The Applications of the Convergence of Iteratively formed Correlation Matrices,” Statistica Sinica 12: 1-23.
    4. Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge, Cambridge University Press.
    5. Faust, K. (1994). Social Network Analysis in the Social and Behavioral Sciences, in Social Network Analysis: Methods and Applications, Cambridge University Press.
    6. Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Biometrics, Second Edition. Springer-Verlag New York.
    7. Huang, T.-M., Kecman, V. and Kopriva, I. (2006). Kernel Based Algorithms for Mining Huge Data Sets: Supervised, Semi-supervised, and Unsupervised Learning (Studies in Computational Intelligence), Springer-Verlag.
    8. Ho, T.K. (1995) “Random Decision Forest.” Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, 14-16 August 1995, 278-282.
    9. Lebret, R. and Collobert, R. (2013). “Word Emdeddings through Hellinger PCA.” The Association for Computer Linguistics, EACL, page 482-490.
    10. Levy, O. and Goldberg, Y. (2014). “Neural Word Embedding as Implicit Matrix Factorization. Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. Montreal, Canada, MIT Press: 2177-2185.
    11. Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X. and Chen, E. (2015). “Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective.” Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina, AAAI Press: 3650-3656.
    12. Liaw, A. and Wiener, M. (2001). “Classification and Regression by RandomForest.” R NEWS 2 (3): 18-22.
    13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J. (2013). “Distributed representations of words and phrases and their compositionality.” Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. Lake Tahoe, Nevada, Curran Associates Inc.: 3111-3119.
    14. Real, R. and Vargas, J. M. (1996). “The Probabilistic Basis of Jaccard`s Index of Similarity,” Systematic Biology, 45(3): 380-385.
    15. Rokach, L. and Maimon, O. (2008). Data Mining with Decision Trees: Theroy and Applications, World Scientific Publishing Co., Inc.
    16. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge, Cambridge University Press.
    17. Simpson, E.H. (1949). “Measurement of diversity,” Nature 163: 688.
    18. Singhal, A. (2001). “Modern Information Retrieval: A Brief Overview,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24(4): 35-43.
    19. Tin Kam, H. (1998). “The random subspace method for constructing decision forests.” IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8): 832-844.
    20. Wu, H.M. and Chen, C.H. (2005). “GAP: Generalized Association Plots for Dimension Free Data Visualization,” the Workshops of 5th Asian Conference on Statistical Computing (IASC-ARS 2005), Hong Kong, Dec. 15-17.
    21. Wu, H.M. and Chen, C.H. (2004). “Matrix Visualization with Nonlinear Association,” 中國統計學社93年社員大會暨統計研討會, November 2004. Chiayi, Taiwan.
    22. Wasserman, S. and Faust, K. (1994). Social Network Analysis in the Social and Behavioral Sciences. In Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences, pp. 3-27). Cambridge: Cambridge University Press.
    23. Yue, J.C. and Clayton, M.K. (2005). “A Similarity Measure based on Species Proportions,” Communications in Statistics-Theory and Methods 34(11): 2123- 2131.
    描述: 碩士
    國立政治大學
    統計學系
    106354021
    資料來源: http://thesis.lib.nccu.edu.tw/record/#G0106354021
    数据类型: thesis
    DOI: 10.6814/NCCU201900992
    显示于类别:[統計學系] 學位論文

    文件中的档案:

    档案 大小格式浏览次数
    402101.pdf10425KbAdobe PDF20检视/开启


    在政大典藏中所有的数据项都受到原著作权保护.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 回馈