政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/112378

政大典藏 > College of Liberal Arts > Graduate Institute of Library, Information and Archival Studies > Theses > Item 140.119/112378

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/112378

Title:	運用光學字元辨識技術建置數位典藏全文資料庫之評估：以明人文集為例 The Analysis of Use Optical Character Recognition to Establish the Full-text Retrieval Database：A Case Study of the Anthology of Chinese Literature in Ming
Authors:	蔡瀚緯 Tsai, Han Wei
Contributors:	林巧敏 Lin, Chiao Min 蔡瀚緯 Tsai, Han Wei
Keywords:	數位典藏光學字元辨識全文資料庫明人文集 Digital archives Optical character recognition Full-Text database Anthology of Chinese Literature in Ming
Date:	2017
Issue Date:	2017-08-31 12:11:53 (UTC+8)
Abstract:	數位典藏是將物件以數位影像的形式進行典藏，並放置在網路系統供使用者瀏覽，能達到流通推廣與保存維護的效果。但在目前資訊爆炸的時代，數位典藏若僅透過詮釋資料描述是無法有效幫助使用者獲得內容資訊，唯有將之建置成全文檢索模式，才能方便使用者快速檢索到所需資訊，而光學字元辨識技術（簡稱OCR）能協助進行全文內容的輸出。本研究藉由實際操作OCR軟體辨識明代古籍，探究古籍版式及影像對於軟體辨識結果之影響；藉由深度訪談訪問有實際參與數位典藏全文化經驗之機構人員，探究機構或個人對於計畫施行之觀點與考量。結果發現，雖然實際辨識結果顯示古籍版式與影像會對於OCR辨識有所影響，綜合訪談內容得知目前技術層面已克服古籍版式的侷限，但對於影像品質的要求仍然很高，意指古籍影像之品質對OCR的辨識影響程度最大；雖然OCR辨識技術已經有所突破，顯示能善用此技術協助進行全文資料庫的建立，但礙於技術陌生、經費預算、人力資源等因素，使得多數機構尚未運用此技術協助執行數位典藏全文化。本研究建議，機構日後若有興趣執行數位典藏全文化計畫，首先，需要制定經常出適合機構執行的作業流程，並且瞭解自身欲處理物件之狀況，好挑選出適合的輸入處理模式；再者，需要多與技術廠商溝通協調，瞭解所挑選之物件是否符合處理上的成本效益；最後，綜合典藏機構與使用者之需求考量下，建議未來採取與OCR廠商合作的方式，由使用者自行挑選需要物件進行OCR辨識，校對完成後將全文內容回饋給典藏機構。這樣不僅能瞭解使用者需求為何，也能降低機構全文校對所耗費的成本。 Digital Archives, placed in the network system for users to browse, change the collection into the digital images, and can help to preserve the collection and promote the content information. However, in the era of information explosion, Digital Archives can’t help users to retrieve the information in the collection by simply recording metadata. So, only when built into the full text retrieval can Digital Archives provide users with a quick retrieval of the information they want. And the Optical Character Recognition (OCR) can help to output the full text information. The study explores the ancient books’ format and impact of image quality on the recognition results by recognizing the ancient books of the Ming dynasty with the OCR software. The study also explores institutional as well as individual views and considerations by in-depth interviewing institutional staff with experiences in the full text of Digital Archives plan. From the result we can discover that though the ancient books’ format and image quality do have influences on the recognition results, the overall interview suggests that the technology has overcome the limitation of the format under the high requirement for the image quality; that is, the quality of ancient books’ images is the most influential factor in the recognition results. Although the OCR already has the breakthrough in assisting the establishment of the full text database, most institutions have not yet applied this technology to full-textualization of the Digital Archives due to technical unfamiliar, budget, human resources and other factors. The study suggests that if some day one institution is interested in working on the the full text of the Digital Archives project, it firstly needs to develop a proper SOP and needs to understand the conditions of their ready-to-be-textualized collections so that it can adopt a suitable input mode. Secondly, this institution needs to communicate with the OCR company more so that it can realize whether the chosen collection fits the cost-effectiveness. Finally, under the considerations of both the institution and users, the study suggests that institutions can cooperate with OCR companies in the future, so users can choose collections for OCR recognition on their own and give the full text to the institutions as feedback after proofreading. This can not only understand users’ needs but also reduce the cost of the proofreading for the institution.
Reference:	一、中文中央研究院歷史語言研究所（2016）。漢籍電子文獻資料庫。檢自：http://hanchi.ihp.sinica.edu.tw/ihp/hanji.htm 中央研究院歷史語言研究所籌備處（1928）。歷史語言研究所工作之旨趣。中央研究院歷史語言研究所集刊，1(1)，3-10。中國哲學書電子化計畫（2017）。本站介紹。檢自：http://ctext.org/zh 中華電子佛典協會（2005）。佛典數位典藏內容開發之研究與建構—數位化工作流程簡介。檢自： https://www.cbeta.org/download/D1135218308.pdf 中華電子佛典協會（2016）。CBETA緣起。檢自：http://www.cbeta.org/intro/origin.php 王成勉（2001）。從張岱文集看明代文史的互通。載於中國明代研究學會（主編），明人文集與明代研究（249-270頁）。臺北市：明代學會。王春瑜（2001）。明人文集的人文傳統。載於中國明代研究學會（主編），明人文集與明代研究（377-387頁）。臺北市：明代學會。王寧、張普、石定果、形紅兵、崔永華、柴鴻斌、陳一凡、宋利強、陳民、韓秀娟、裴立杰（2009）。現代常用字部件及部件名稱規範。中華人民共和國教育部：國家語言文字工作委員會。北京書同文電腦技術開發有限責任公司（2016）。四部叢刊電子版。檢自：http://www.unihan.com.cn/product/sbck/ 吉常宏（1994）。中國人的名字別號。臺北市：臺灣商務。余崇生（2013）。張岱人物小品的書寫特色。載於余崇生（主編），閱讀明清：明清文學的文化探索（3-14頁）。臺北市：萬卷樓。余顯強（2005）。北平「世界日報」：民初歷史性新聞報紙數位化之研究。圖書與資訊學刊，54，84-95。吳士朋（2001）。從《袁中郎全集》看公安派文學運動。載於中國明代研究學會（主編），明人文集與明代研究（233-248頁）。臺北市：明代學會。吳明德、黃文琪、陳世娟（2006）。人文學者使用中文古籍全文資料庫之研究。圖書資訊學刊，4(1/2)，1-15。吳政上（2004年12月）。漢學研究資源的「再發現」：中央研究院歷史語言研究所珍藏歷史文物數位典藏計畫。「數位時代漢學研究資源國際研討會」發表之論文，國家圖書館國際會議廳。吳振漢（2001）。明代後期舉貢出身文官之仕途。載於中國明代研究學會（主編），明人文集與明代研究（317-338頁）。臺北市：明代學會。吳格（2001）。《明人文集篇目索引數據庫》編製芻議。載於中國明代研究學會（主編），明人文集與明代研究（407-422頁）。臺北市：明代學會。吳智和（2001）。明人文集中的生活史料—以居家休閒生活為例。載於中國明代研究學會（主編），明人文集與明代研究（135-166頁）。臺北市：明代學會。吳量愷（2001）。《張居正集》與明朝中晚期社會異變。載於中國明代研究學會（主編），明人文集與明代研究（103-120頁）。臺北市：明代學會。吳璧雍、許媛婷（2006）。故宮善本古籍的典藏特色及其數位化發展概況。大學圖書館，10(2)，34-49。呂信德、溫敏淦、范國清、林志瑋（2009）。相機取像的影像文字截取與切割。聯大學報，6(2)，369-397。宋建成（2004）。中華百科全書：四部叢刊。檢自：http://ap6.pccu.edu.tw/Encyclopedia_media/main-all.asp?id=9743 李佩瑛、程婉如（2009）。期刊報紙數位化工作流程指南。臺北市：數位典藏拓展台灣數位典藏計畫。李宜容（1998）。人文及社會學科讀者使用線上公用目錄檢索詞彙之研究。大學圖書館，2(3)，72-104。李清志（1985）。明代中葉以後版刻特徵。載於吳哲夫（主編），古籍鑑定與維護研習會專集（96-121頁）。臺北市：中國圖書館學會。李焯然（2001）。丘濬及其《瓊台類稿》。載於中國明代研究學會（主編），明人文集與明代研究（69-86頁）。臺北市：明代學會。周駿富（1985）。明代前期版刻特徵。載於吳哲夫（主編），古籍鑑定與維護研習會專集（83-95頁）。臺北市：中國圖書館學會。林妙樺（2004年12月）。古籍導入數位學習之模式初探。「數位時代漢學研究資源國際研討會」發表之論文，國家圖書館國際會議廳。邱皓政（2006）。量化研究與統計分析 : SPSS中文視窗版資料分析範例解析。臺北市：五南。凃金堂（2010）。SPSS與量化硏究。臺北市：五南。南炳文（2001）。《萬曆起居注》、《明神宗實錄》和《李文節集》中的李延機內閣奏疏。載於中國明代研究學會（主編），明人文集與明代研究（29-50頁）。臺北市：明代學會。迪志文化出版有限公司（2006）。建庫過程。檢自：http://www.sikuquanshu.com/project/main.aspx 國立故宮博物院（2016）。善本古籍資料庫。檢自：http://npmhost.npm.gov.tw/tts/npmmeta/RB/RB.html 國家圖書館（2016a）。明人文集聯合目錄與篇目索引資料庫。檢自：http://nclcc.ncl.edu.tw/ttsweb/top_02.htm 國家圖書館（2016b）。館藏特色。檢自：http://www.ncl.edu.tw/content_53.html 國家圖書館（2016c）。古籍與特藏文獻資源。檢自：http://rbook2.ncl.edu.tw/ 崔文印（2009）。古籍常識叢談。北京：中華書局。張俊盛、陳舜德（1995）。雜訊通道模型在OCR後處理之應用。影像與識別，3(3) ，98-109。張圍東（2009），國家圖書館古籍文獻國際合作數位典藏計畫：以美國國會圖書館為例。臺灣圖書館管理季刊，5(4)，99-110。張璉（2001）。現存明人文集的特色與《明人文集聯合目錄與篇目索引資料庫》建置概述。載於中國明代研究學會（主編），明人文集與明代研究（423-430頁）。臺北市：明代學會。張顯清（2001）。要重視明清之際人士文集的研究和整理——以孫奇逢文集為例。載於中國明代研究學會（主編），明人文集與明代研究（1-12頁）。臺北市：明代學會。莊德明、鄧賢瑛（2009）。漢字構形資料庫的研發與應用。檢自：http://cdp.sinica.edu.tw/service/documents/T090904.pdf 陳宏天（1992）。古籍版本概要。臺北市：洪葉文化出版。陳秀慧（2000）。館藏數位化的程序及其問題。圖書與資訊學刊，33，66-80。陳和琴（2001）。Metadata與數位典藏之研討。大學圖書館，5(2)，2-11。陳金木（2008）。電子全文資料庫與學術研究—以《四部叢刊電子全文檢索版》為例。明道通識論叢，5，122-135。陳建安、江昭德、王俊雄、江宏昇（2015）。商港RFID自動管制應用與優化。電信研究，45(6)，13-22。陳梧桐（2001）。明人文集的史學價值。載於中國明代研究學會（主編），明人文集與明代研究（387-406頁）。臺北市：明代學會。陳碧月（2013）。風姿綽約的「明清小品」。載於余崇生（主編），閱讀明清：明清文學的文化探索（15-28頁）。臺北市：萬卷樓。陳寶良（2001）。明人文集之學政史料及其價值。載於中國明代研究學會（主編），明人文集與明代研究（339-358頁）。臺北市：明代學會。曾元顯（2004）。應用於資訊檢索的中文OCR錯誤詞彙自動更正。中國圖書館學會會報，72，23-31。曾逸鴻、林裕淵（2007）。中文文件影像中之特殊字體偵測。科學與工程技術期刊，3(4)，29-39。華通資訊（2016）。Info Receipt OCR華通信用卡簽單辨識系統。檢自：http://www.infoacer.com.tw/software/ReceiptOCR.aspx 項潔、陳雪華、鄭惇方（2002）。數位典藏之產業前景探討。經濟部技術處學界科專非技術領域學術研討會論文集，435-446。黃永年（2005）。古籍版本學。南京：江蘇教育出版社。黃沛榮（2009）。漢字教學的理論與實踐。臺北市：樂學。黃桂蘭（2001）。論張岱小品文的雅趣與諧趣。載於中國明代研究學會（主編），明人文集與明代研究（271-288頁）。臺北市：明代學會。楊美莉、劉芳如、劉錚雲（2004年12月）。迎向數位時代—國立故宮博物院珍藏文物之數位典藏現況。「數位時代漢學研究資源國際研討會」發表之論文，國家圖書館國際會議廳。廖益賢（2012）。電子文獻全文檢索資料庫管窺—以「中央研究院漢籍電子文獻」資料庫所收《文心雕龍》為例。中國文化大學中文學報，25，285-304。趙前（2008）。明代版刻圖典。北京：文物出版社。潘美月（1985）。明代官私刻書。載於吳哲夫（主編），古籍鑑定與維護研習會專集（122-136頁）。臺北市：中國圖書館學會。潘朝陽（1994）。OCR/中文OCR技術。光學工程，47，48-53。蔡孟竹、曾元顯（2003）。中文OCR文件檢索測試集之製作與應用。教育資料與圖書館學，40(3)，325-344。蔡登法（1994）。從掃描器到盲人閱讀機—談OCR(Optical Character Recognition)的特殊效應。影像與識別，2(8)，21-25。駱偉（2004）。簡明古籍整理與版本學。澳門：澳門圖書館暨資訊管理協會。檔案管理局（主編）（2008）。國家檔案數位化影像品質之研究。臺北市：行政院研考會。濱島敦俊（2001）。日本靜嘉堂所藏《朱文肅公集》與朱國禎。載於中國明代研究學會（主編），明人文集與明代研究（13-28頁）。臺北市：明代學會。謝清俊（2004年9月）。漢籍全文資料庫的緣起、沿革與展望。「漢籍全文資料庫系統改版發表會」發表之論文，中研院史語所文物陳列館地下一樓演講廳。謝清俊、林晰（1997）。中央研究院古籍全文資料庫的發展概要。中文計算語言學期刊，2(1)，105-130。謝瀛春（主編）（2005）。數位典藏國家型科技計畫二零零五年版。臺北市：國科會數位典藏國家型科技計畫。羅鳳珠（2004）。臺灣地區中國古籍文獻資料數位化的過程與未來的發展方向。載於邱炯友、周彥文（主編），五十年來的圖書文獻學研究（311-342頁）。臺北市：臺灣學生。嚴長海、洪瀚霖（2005）。檔案微縮作業與保存維護。臺北市：檔案管理局。顧力仁（2001）。中文古籍全文資料庫建置比較研究。國家圖書館館刊，90(2)，197-216。顧力仁（2002）。永樂大典數位化相關問題之探討：兼論資訊科技對古籍整理的影響。圖書館學與資訊科學，28(1)，33-48。顧力仁（2004年12月）。館藏漢學數位資源的建置與分享：古籍書目、影像及全文資料庫的結合。「數位時代漢學研究資源國際研討會」發表之論文，國家圖書館國際會議廳。二、英文 Al-A’ali, M., & Ahmad, J. (2007). Optical Character Recognition System for Arabic Text Using Cursive Multi-Directional Approach. Journal of Computer Science, 3(7), 549-555. Badoiu, V., Ciobanu, A. C., & Craitoiu, S. (2016). OCR Quality Improvement Using Image Preprocessing. Journal of Information Systems & Operations Management, 10(1), 1-13. Balk, H., & Ploeger, L. (2009). IMPACT: working together to address the challenges involving mass digitization of historical printed text. OCLC Systems & Services: International digital library perspectives, 25(4), 233-248. Casey, R., & Nagy, G. (1966). Recognition of Printed Chinese Characters. IEEE Transactions on Electronic Computers, 15(1), 91-101. Chapman, S., & Kenney, A. R. (1996). Digital Conversion of Research Library Materials: A Case for Full Informational Capture. D-Lib Magazine, 2(10). Retrieved from http://www.dlib.org/dlib/october96/cornell/10chapman.html Chen, J. L., & Lee, H. J. (1998). An Efficient Algorithm for Form Structure Extraction Using Strip Projection. Pattern Recognition, 31(9), 1353-1368. Cojocaru, S., Colesnicov, A., Malahov, L., & Bumbu, T. (2016). Optical Character Recognition Applied to Romanian Printed Texts of the 18th-20th Century. Computer Science Journal of Moldova, 24, 106-117. Holley, R. (2009). How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs. D-Lib Magazine, 15(3/4). Retrieved from http://www.dlib.org/dlib/march09/holley/03holley.html#1 Levine-Clark, M. (2007). Electronic Books and the Humanities: A Survey at the University of Denver. Collection Building, 26(1), 7-14. Mariner, M. C. (2010). Optical Character Recognition (OCR). In Bates, M. J. & Maack, M. N. (Eds.), Encyclopedia of library and Information Sciences, Third Edition (pp. 4037-4044). Boca Raton, Fla: CRC Press. Mori, S., Suen, C. Y., & Yamamoto, K. (1992). Historical Review of OCR Research and Development. Proceedings of the IEEE, 80(7), 1029-1058. Patel, C., Patel, A., & Patel, D. (2012). Optical Character Recognition by Open Source OCR Tool Tesseract: A Case Study. International Journal of Computer Applications, 55(10), 50-56. Powell, T., & Paynter, G. (2009). Going Grey? Comparing the OCR Accuracy Levels of Bitonal and Greyscale Images. D-Lib Magazine, 15(3/4). Retrieved from http://www.dlib.org/dlib/march09/powell/03powell.html Rose, T. (2002). Technology’s Impact on the Information-Seeking Behavior of Art Historians. Art Documentation: Journal of the Art Libraries Society of North America, 21(2), 35-42. Sinn, D., & Soares, N. (2014). Historian’s Use of Digital Archival Collections: The Web, Historical Scholarship, and Archival Research. Journal of the Association for Information Science and Technology, 65(9), 1794-1809. Smith, R. (2007). An Overview of the Tesseract OCR Engine. In IEEE Computer Society (Eds.), Ninth International Conference on Document Analysis and Recognition (pp. 629-623), New York, NY: IEEE. Sobel, K., & Beall, J. (2011). Humanities Research, Book Digitization, and the Problem of Linguistic Change. Library Innovation, 2(2), 3-15. Sun, W., Liu, L. M., Zhang, W., & Comfort, J. C. (1992). Intelligent OCR Processing. Journal of The American Society for Information Science. 43(6), 422-431. Wagner, B., Brantl, M., & Meinlschmidt, P. (2012). Analysing Rare Books With Digital Technologies: The Project “Blockbooks In Bavarian Collections”. Knjiznica, 56(3), 127-145. Zhou, Y. (2010). Are Your Digital Documents Web Friendly? Making Scanned Documents Web Accessible. Information Technology and Libraries, 29(3), 151-160. Zhu, Y., Tan, T., & Wang, Y. (2001). Font Recognition Based on Global Texture Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(10), 1192-1200.
Description:	碩士國立政治大學圖書資訊與檔案學研究所 104155017
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0104155017
Data Type:	thesis
Appears in Collections:	[Graduate Institute of Library, Information and Archival Studies] Theses

Files in This Item:

File	Size	Format
501701.pdf	3494Kb	Adobe PDF2	537	View/Open

社群 sharing

Loading...