政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/100571

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 112721/143689 (78%)
Visitors : 49630089 Online Users : 510

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 資訊學院 > 資訊科學系碩士在職專班 > 學位論文 > Item 140.119/100571

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/100571

Title:	基於主題模型之社群媒體內容分析探索 Exploring Topic Models for Analyzing the Contents of Social Media
Authors:	廖舒婷 Liao, Shu Ting
Contributors:	陳恭 Chen, Kung 廖舒婷 Liao, Shu Ting
Keywords:	主題分析文字探勘社群媒體 Topic Models Text Mining Social Media
Date:	2016
Issue Date:	2016-08-22 13:40:38 (UTC+8)
Abstract:	隨著網路文章訊息量的快速增長，傳統內容分析已無法在短時間內有效地處理和解析龐雜文本潛在意義，為此，本研究嘗試建置一套以非監督式學習主題模型技術為核心的工具，結合自然語言處理可協助研究學者快速處理與探索大量中文資料，挖掘蘊藏的知識。並透過整合自動化的評估機制，提供模型效果好壞之參考。另由於主題模型所產出的結果仍需要人工判讀，因此本研究再利用視覺化技術呈現，以輔助研究學者詮釋結果。本研究以太陽花學運期間六個來源收集資料為實驗對象，包括Facebook、Twitter以及四大即時新聞報，實驗結果顯示本研究建置之工具可以有效地應用於大量中文文本內容探索，有助於減少人工處理和手動作業，並縮短整個資料分析時程。藉由主題模型技術，我們得以探討社群媒體和新聞媒體關注議題之異同，而研究過程也發現不只台灣民眾以及新聞媒體關心太陽花學運，來自香港、大陸等世界各地的網友亦藉由社群媒體平台主動關注或發表意見。另依據主題的分布情況，亦可作為話題熱門度的指標。最後，本研究進行模型效度評估，觀察衡量主題模型應用於不同性質中文文本資料之可行性與限制。此外，本研究透過文本歸類計算取得資料集主題的組成便可作為初步篩選資料集之重要特徵，從而提出未來可延伸發展的方向。 Recently, the data retrieved from the internet are too large for traditional content analysis methods to handle and extract high quality insights in reasonable amounts of time. To address this issue, we develop a data analysis system based on unsupervised topic modeling method. In particular, we focus on applying this tool to process Chinese texts. By a proper integration with the Chinese tokenization tool, jieba, our system is able to explore and analyze Chinese documents rapidly yet effectively. Besides, the system also automatically performs a quantitative evaluation of the quality of the generated model, which is useful for the user to get an idea quickly about how well the model works. Finally, as the outputs produced by topic modeling rely on human interpretation, we present a method for visualizing topic modeling results to help end-users understand and interpret what topics have been discovered. To evaluate our system, six Chinese text data sets of different network media sources are used for experiment. The result in this study shows the proposed system can be applied to analyze large volumes of unlabeled Chinese text and help reduce manual work, and shorten the amount of time required. We then compare the topics found from social media with those from online news. It is observed that Taiwan’s Sunflower Movement not only received great attention from people in Taiwan, overseas users in Hong Kong or China also express their concerns and opinions through social media. Furthermore, according to topic distribution, we can also find hot topics easily. Finally, we conduct some experiments to evaluate and understand the limiting factors of the propose system. An interesting finding is that our system can act as a data filter tool where the composition of data sets can be computed and used to define the filters for quick selection of relevant data sets from large data sets.
Reference:	[1] Sullivan, Dan. (2001). Document Warehousing and Text Mining Techniques for Improving Business Operations, Marketing,and Sales. New York: John Wiley & Sons. [2] Tan, A. H. (1999). Text mining: The state of the art and the challenges. In Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases (Vol. 8, pp. 65-70). [3] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41,pp. 391-407. [4] T. Hofmann. (1999). Probabilistic latent semantic indexing. presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, USA. [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. (2003). Latent dirichlet allocation. J. Mach. Learn. Res.,vol. 3,pp. 993-1022. [6] M. Steyvers and T. Griffths. Probabilistic topic models. (2006). [7] Hall, David, Daniel Jurafsky and Christopher D. Manning. (2008). Studying the history of ideas using topic models. Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics. [8] Phan, Xuan-Hieu, Le-Minh Nguyen, and Susumu Horiguchi. (2008). Learning to classify short and sparse text & web with hidden topics from large-scale data collections. Proceedings of the 17th international conference on World Wide Web. ACM. [9] Xin Zhao, Jing Jiang, JianshuWeng et al. (2011). Comparing Twitter and traditional media using topic models. In Proceedings of the European Conference on Information Retrieval. [10] Brody, Samuel, and Noemie Elhadad. (2010). An unsupervised aspect-sentiment model for online reviews. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics. [11] 楚克明， and 李芳. "基于 LDA 模型的新聞話題的演化." 计算机应用与软件 28.4 (2011): 4-7. [12] 冯时，景珊，杨卓， and 王大玲， "基于 LDA 模型的中文微博话题意见领袖挖掘，" 东北大学学报: 自然科学版， vol. 34， pp. 490-494， 2013. [13] 張日威，"應用LDA進行Plurk主題分類及使用者情緒分析"，雲科大資訊管理學系碩士論文，2014. [14] 李日斌， "探討臺灣網民對鄰國的情感"，中山大學資訊管理學系研究所碩士論文，2014. [15] Chang， J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296). [16] Newman, D., Lau, J. H. , Grieser, K. ,& Baldwin, T. (2010). Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 100-108). Association for Computational Linguistics. [17] Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 262-272). Association for Computational Linguistics. [18] Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408). ACM.ISO 690. [19] Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics,17-35. [20] Maiya, A. S., & Rolfe, R. M. (2014). Topic similarity networks: visual analytics for large document sets. In Big Data (Big Data),2014 IEEE International Conference on (pp. 364-372). IEEE. [21] Harris, Z. S. (1954). Distributional Structure. Word,10(2/3),146–162. [22] Parnas, D. L. (1972). On the criteria to be used in decomposing systems into modules. Communications of the ACM,15(12),1053-1058. [23] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. [24] Newman, D., Hagedorn, K., Chemudugunta, C., & Smyth, P. (2007). Subject metadata enrichment using statistical topic models. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (pp. 366-375). ACM. [25] 謝宗震 (2014)。服貿事件 X 資料科學。檢自：http://readata.org/ecfa-and-data-science/
Description:	碩士國立政治大學資訊科學系碩士在職專班 103971002
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0103971002
Data Type:	thesis
Appears in Collections:	[資訊科學系碩士在職專班] 學位論文

Files in This Item:

File	Size	Format
100201.pdf	4057Kb	Adobe PDF2	659	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback