政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/147032
English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 110387/141319 (78%)
Visitors : 46966798      Online Users : 612
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/147032


    Title: 漢字古文書光學字元辨識之文本閱讀順序偵測研究
    Reading Order Detection in Optical Character Recognition for Historical Chinese Documents
    Authors: 馬行遠
    Ma, Hsing-Yuan
    Contributors: 劉昭麟
    黃瀚萱

    Liu, Chao-Lin
    Huang, Hen-Hsen

    馬行遠
    Ma, Hsing-Yuan
    Keywords: 閱讀順序
    排序學習
    多模態模型
    古籍文本處理
    Reading Order Detection
    Pairwise Learning-to-Rank
    Multimodal Representation
    Archival Document ProcessingMultimodal Representation
    Date: 2023
    Issue Date: 2023-09-01 15:24:26 (UTC+8)
    Abstract: 在光學字元識別(OCR)和文檔版面分析(DLA)的研究和發展已累積了多年的豐富經驗,然而閱讀順序偵測的問題卻仍然是一個待解的難題。閱讀順序偵測在維護文檔原始結構以及對文字偵測後的校正過程中,扮演著至關重要的角色。目前,大部分閱讀順序偵測工具主要依賴於基於規則的算法來處理。對於結構簡單、排列規整且間距均勻的現代文檔,這些方法的確能夠取得不錯的成果。然而,當面對手寫或古代文本中複雜的版面以及不平整的邊緣,現有的方法便明顯力不從心。因此,我們迫切需要一種能對複雜版面的中文古籍進行精準閱讀順序偵測的策略。
    本研究以當前主流的OCR框架為基礎,提出一個專注於閱讀順序偵測的模型。此模型著重考量人類閱讀歷程的模擬,將圖像線索視為確定閱讀順序的關鍵線索,並且獨創性地提出一種多模態閱讀順序偵測方法,成功地簡化了閱讀順序任務的處理流程,並在中文古籍MTHv2資料集上進行驗證。實驗結果指出,與先前的研究方法相比,我們的模型成功地降低了25%的頁面錯誤率。此外,它在有限的訓練資料和文字偵測資訊不足的情境下也展現出良好的效能,證明了本研究的韌性和實際應用價值。
    Optical character recognition (OCR) and document layout analysis (DLA) have been developed for years.
    Still, reading order detection (ROD) is a problem that needs to be solved.
    ROD plays an important role in preserving the original structure of the document as well as in post-OCR correction.
    Most modern ROD tools rely on rule-based algorithms to place detected text coordinates in order.
    These approaches may work well for simple, modern documents because they are well-aligned and spaced.
    However, due to the complex layouts and curved layout edges in handwritten or historical documents, current methods are inadequate.
    In this paper, we proposed a multimodal approach to ROD by formulating the task as pairwise learning-to-rank.
    We evaluate our approach on the MTHv2 dataset.
    Experimental results indicate that, compared to previous research methods, our model successfully reduced the page error rate by 25%. Furthermore, it demonstrated good performance even in scenarios with limited training data and insufficient text detection information, proving the robustness and practical value of this research.
    Reference: [1] Abid, A., Abdalla, A., Abid, A., Khan, D., Alfozan, A., Zou, J.: Gradio: Hasslefree sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569(2019)
    [2] Aiello, M., Pegoretti, A.: Textual article clustering in newspaper pages. Applied Artificial Intelligence 20(9), 767–796 (2006).
    https://doi.org/10.1080/08839510600903858
    [3] Clausner, C., Pletschacher, S., Antonacopoulos, A.: The significance of reading order in document recognition and its evaluation. 2013 12th International Conference
    on Document Analysis and Recognition 688–692 (2013)
    [4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
    T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:
    An image is worth 16x16 words: Transformers for image recognition at scale (2021)
    [5] Du, Y., Chen, Z., Jia, C., Yin, X., Zheng, T., Li, C., Du, Y., Jiang, Y.G.: Svtr: Scene
    text recognition with a single visual model (2022)
    [6] Egly, R., Driver, J., Rafal, R.: Shifting visual attention between objects and locations: evidence from normal and parietal lesion subjects. Journal of Experimental
    Psychology: General 123(2), 161–177 (jun 1994). https://doi.org/10.1037//0096-
    3445.123.2.161
    [7] Ferilli, S., Grieco, D., Redavid, D., Esposito, F.: Abstract argumentation for reading
    order detection. In: ACM Symposium on Document Engineering (2014)
    [8] Gu, Z., Meng, C., Wang, K., Lan, J., Wang, W., Gu, M., Zhang, L.: Xylayoutlm:
    Towards layout-aware multimodal networks for visually-rich document understanding (2022). https://doi.org/10.48550/ARXIV.2203.06947
    [9] Ha, J., Haralick, R., Phillips, I.: Recursive x-y cut using bounding boxes
    of connected components. In: Proceedings of 3rd International Conference
    on Document Analysis and Recognition. vol. 2, 952–955 vol.2 (1995).
    https://doi.org/10.1109/ICDAR.1995.602059
    [10] Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L., Tan, M., Chu, G., Va-
    sudevan, V., Zhu, Y., Pang, R., Adam, H., Le, Q.: Searching for mobilenetv3.
    In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
    1314–1324. IEEE Computer Society, Los Alamitos, CA, USA (nov 2019).
    https://doi.org/10.1109/ICCV.2019.00140
    [11] Iani, C., Nicoletti, R., Rubichi, S., Umiltà, C.: Shifting attention between objects. Cognitive Brain Research 11(1), 157–164 (2001).
    https://doi.org/10.1016/S0926-6410(00)00076-8
    [12] KENDALL, M.G.: A NEW MEASURE OF RANK CORRELATION. Biometrika
    30(1-2), 81–93 (06 1938). https://doi.org/10.1093/biomet/30.1-2.81
    [13] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014).
    https://doi.org/10.48550/ARXIV.1412.6980
    [14] Kosinski, M.: Theory of mind may have spontaneously emerged in large language
    models (2023)
    [15] Kumar, R., Vassilvitskii, S.: Generalized distances between rankings. In: Proceedings of the 19th International Conference on World Wide Web. 571 –40
    580. WWW ’10, Association for Computing Machinery, New York, NY,
    USA (2010). https://doi.org/10.1145/1772690.1772749
    [16] Lamy, D., Egeth, H.: Object-based selection: The role of attentional shifts. Perception & Psychophysics 64(1), 52–66 (2002). https://doi.org/10.3758/BF03194557
    [17] Li, L., Gao, F., Bu, J., Wang, Y., Yu, Z., Zheng, Q.: An end-to-end ocr text
    re-organization sequence learning for rich-text detail image comprehension. In:
    Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. 85–100. Springer International Publishing, Cham (2020)
    [18] Liao, M., Zou, Z., Wan, Z., Yao, C., Bai, X.: Real-time scene text detection with
    differentiable binarization and adaptive scale fusion (2022)
    [19] Liu, Z.Y.: Understanding of Printed Ancient Book and Book Collectors. studentbooktw (2007)
    [20] Ma, W., Zhang, H., Jin, L., Wu, S., Wang, J., Wang, Y.: Joint layout analysis, character detection and recognition for historical document digitization (2020).
    https://doi.org/10.48550/ARXIV.2007.06890, https://arxiv.org/abs/2007.06890
    [21] Mai, J., Chen, J., Li, B., Qian, G., Elhoseiny, M., Ghanem, B.: Llm as a robotic
    brain: Unifying egocentric memory and control (2023)
    [22] Malerba, D., Ceci, M., Berardi, M.: Machine Learning for Reading Order Detection in Document Image Understanding, vol. 90, 45–69 (12 2007).
    https://doi.org/10.1007/978-3-540-76280-5_3
    [23] Mukherjee, K., Khare, A., Verma, A.: A simple dynamic learning rate tuning algorithm for automated training of dnns (2019).
    https://doi.org/10.48550/ARXIV.1910.11605
    [24] Naoum, A., Nothman, J., Curran, J.: Article segmentation in digitised
    newspapers with a 2d markov model. In: 2019 International Conference
    on Document Analysis and Recognition (ICDAR). 1007–1014 (2019).
    https://doi.org/10.1109/ICDAR.2019.00165
    [25] Neisser, U.: Cognitive Psychology. Appleton-Century-Crofts, New York (1967)
    [26] Park, J.S., O’Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Generative agents: Interactive simulacra of human behavior (2023)
    [27] Posner, M.: Orienting of attention. The Quarterly journal of experimental psychology 32, 3–25 (03 1980). https://doi.org/10.1080/00335558008248231
    [28] Quiros, L., Vidal, E.: Learning to sort handwritten text lines in reading order through estimated binary order relations. In: 2020 25th Inter-
    national Conference on Pattern Recognition (ICPR). 7661–7668 (2021).
    https://doi.org/10.1109/ICPR48806.2021.9413256
    [29] Quirós, L., Vidal, E.: Reading order detection on handwritten documents. Neural Computation and Applications 34, 9593–9611 (2022).
    https://doi.org/10.1007/s00521-022-06948-5
    [30] Villalobos, P., Sevilla, J., Heim, L., Besiroglu, T., Hobbhahn, M., Ho, A.: Will we
    run out of data? an analysis of the limits of scaling datasets in machine learning(2022)
    [31] Walczyk, J.J.: The interplay between automatic and control processes in reading.
    Reading Research Quarterly 35(4), 554–566 (2000), http://www.jstor.org/stable/748099
    [32] Wang, Z., Xu, Y., Cui, L., Shang, J., Wei, F.: LayoutReader: Pre-training of
    text and layout for reading order detection. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4735–4744.
    Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (Nov 2021). https://doi.org/10.18653/v1/2021.emnlp-main.389, https://
    aclanthology.org/2021.emnlp-main.389
    [33] Wei, L.: Simple Organization and Version Study of Ancient Books. Macao Library
    & Information Management Association, Macao (2004)
    [34] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training
    of text and layout for document image understanding. In: Proceedings of the 26th
    ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM (aug 2020). https://doi.org/10.1145/3394486.3403172, https://doi.org/
    10.1145%2F3394486.3403172
    [35] Yang, H., Jin, L., Huang, W., Yang, Z., Lai, S., Sun, J.: Dense and tight detection of chinese characters in historical documents: Datasets and a recognition guided detector. IEEE Access 6, 30174–30183 (2018).
    https://doi.org/10.1109/ACCESS.2018.2840218
    [36] Yu, H., Chen, J., Li, B., Xue, X.: Chinese character recognition with radicalstructured stroke trees (2022)
    Description: 碩士
    國立政治大學
    資訊科學系
    110753132
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0110753132
    Data Type: thesis
    Appears in Collections:[Department of Computer Science ] Theses

    Files in This Item:

    File Description SizeFormat
    313201.pdf17913KbAdobe PDF20View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback