Loading...
|
Please use this identifier to cite or link to this item:
https://nccur.lib.nccu.edu.tw/handle/140.119/159040
|
Title: | 以統計方法辨識碩士及ChatGPT生成之經濟學類論文摘要 A Statistical Approach to Identifying Master's and ChatGPT-Generated Abstracts in the Field of Economics |
Authors: | 張祐瑜 Chang, You-yu |
Contributors: | 余清祥 楊曉文 Yu, Qing-Xiang Yang, Xiao-Wen 張祐瑜 Chang, You-yu |
Keywords: | 文字分析 探索性資料分析 ChatGPT 大型語言模型 寫作風格 Text analysis Exploratory Data Analysis (EDA) ChatGPT Large Language Models Writing Style |
Date: | 2025 |
Issue Date: | 2025-09-01 14:49:47 (UTC+8) |
Abstract: | 文字紀錄使後世得以窺見各時期的文化發展、社會縮影及技術演變。其中, 摘要通常可視為一篇文章或書籍的濃縮,讀者可迅速掌握全文主軸及重點,其風格與架構有別於一般文章。由於電腦科技快速發展進步,大型語言等模型為生活增加便利及創造發展可能,但同時也帶來潛在隱患,近年屢傳獲獎文章及創作依賴ChatGPT等AI工具,此舉不僅引發公平性的討論,也顛覆傳統對於教育、研究、創新的作法及定位。ChatGPT於民國111年推出,不久後便廣為人知,故本文以民國107~109年臺灣經濟學門碩士論文摘要為研究對象,彼時生成式AI尚未蔚為風氣。同時藉由 ChatGPT 生成等量摘要,透過探索性資料分析及統計方法比較兩種文本的寫作風格,進而判別論文真偽。 研究整理兩文本字、詞、句長及多樣性等基本統計量,結合常見字詞、模糊 性詞彙作為解釋變數,以統計模型—羅吉斯迴歸篩選出具顯著性特徵,並與機器學習模型比較。結果顯示使用 ChatGPT 生成的摘要傾向使用短句建構文本,碩士生文章則為長短句交錯;生成文本使用「提升」、「提供」、「建議」及「此外」等詞彙的比例較高,碩士生摘要運用虛字「之」頻率較高。以上述探索性資料分析挑選之解釋變數,利用羅吉斯迴歸、隨機森林等機器學習模型辨別論文摘要真偽,其分類準確率皆有不錯得效果,不過,本文方法可透過較少變數及計算量即可達到類似效果,並能提供讀者區隔寫作特色的主要差異。 Written records provide future generations insights into cultural developments, societal snapshots, and technological evolutions across various historical periods. Among these, abstracts serve as condensed versions of articles or books, enabling readers to quickly grasp the main ideas and essential points. Their style and structure differ significantly from regular prose. With rapid advancements in computer technology, large language models have brought convenience and developmental opportunities into daily life. However, they have also introduced potential concerns. In recent years, numerous award-winning articles and creative works have reportedly relied heavily on AI tools like ChatGPT, raising debates about fairness and fundamentally transforming traditional educational, research, and innovation practices. ChatGPT, released in 2022, quickly gained widespread attention. Hence, this study focuses on the abstracts of Taiwanese master's theses in economics from 2018 to 2020, a period before generative AI became prevalent. Equivalent volumes of abstracts were generated using ChatGPT for comparative analysis of writing styles through exploratory data analysis (EDA) and statistical methods to distinguish authenticity. This research compiles fundamental textual statistics including characters, words, sentence length, and diversity, alongside frequently occurring and ambiguous words as explanatory variables. Statistical models, specifically logistic regression, were used to identify significant features, and results were compared with machine learning models. Findings indicate that ChatGPT-generated abstracts tend to use shorter sentences, whereas master's students’ abstracts feature a mix of long and short sentences. Generated texts exhibit a higher frequency of words such as "enhance," "provide," "suggest," and "furthermore," while master's students more frequently use function words like "of" (之). Utilizing explanatory variables selected from EDA, logistic regression and machine learning models such as random forest successfully classified the authenticity of the abstracts with high accuracy. Notably, the methods employed in this study achieved similar classification accuracy with fewer variables and reduced computational effort, clearly highlighting significant stylistic distinctions for readers. |
Reference: | 一、 中文文獻 [1] 余清祥. (1998). 統計在紅樓夢的應用 (註). 國立政治大學學報, (76-77), 303. [2] 余清祥, & 葉昱廷. (2020). 以文字探勘技術分析臺灣四大報文字風格. 數位典藏與數位人文, (6), 69-96. [3] 陳庭偉. (2021). 運用文字探勘分析人民日報的風格變遷. 政治大學統計學系學位論文, 2021, 1-78. [4] 郭小东. (2023). 生成式人工智能的风险及其包容性法律治理. 北京理工大学学报 (社会科学版), 25(6), 93-105. 二、 英文文獻 [1] Church, K. W. (1989). A stochastic parts program and noun phrase parser for unrestricted text. In International Conference on Acoustics, Speech, and Signal Processing, pp. 695-698. IEEE. [2] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. [3] Ramos, J. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning 242(1), 29-48. [4] Ma, W. Y. & Chen, K. J. (2003). A bottom-up merging algorithm for Chinese unknown word extraction. In Proceedings of the second SIGHAN workshop on Chinese language processing, 31-38. [5] Mihalcea, R., & Tarau, P. (2004). Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, 404-411. [6] Chen, Z., Huang, L., Yang, W., Meng, P., & Miao, H. (2012). More than word frequencies: Authorship attribution via natural frequency zoned word distribution analysis. arXiv preprint arXiv:1208.3001. [7] Hu, X., Wang, Y., & Wu, Q. (2014). Multiple authors detection: a quantitative analysis of dream of the red chamber. Advances in adaptive data analysis, 6(04), 1450012. [8] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. [9] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785-794. [10] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. [11] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186). [12] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901. [13] Desaire, H., Chua, A. E., Isom, M., Jarosova, R., & Hua, D. (2023). Distinguishing academic science writing from humans or ChatGPT with over 99% accuracy using off-the-shelf machine learning tools. Cell Reports Physical Science, 4(6). [14] Webb, L., & Schönberger, D. (2024). Generative AI and the problem of existential risk. arXiv preprint arXiv:2407.13365. |
Description: | 碩士 國立政治大學 統計學系 112354021 |
Source URI: | http://thesis.lib.nccu.edu.tw/record/#G0112354021 |
Data Type: | thesis |
Appears in Collections: | [統計學系] 學位論文
|
Files in This Item:
File |
Description |
Size | Format | |
402101.pdf | | 2689Kb | Adobe PDF | 0 | View/Open |
|
All items in 政大典藏 are protected by copyright, with all rights reserved.
|