政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/160063

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 118786/149850 (79%)
Visitors : 82257617 Online Users : 3458

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/160063

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/160063

Title:	利用多模態輸入於大型語言模型以理解使用者與機器人溝通之意圖 Leveraging Flexible and Imprecise Multimodal Input for LLMs to Understand Users’ Intentions for HRI
Authors:	劉彥廷 Liu, Yen-Ting
Contributors:	蔡欣叡 Tsai, Hsin-Ruey 劉彥廷 Liu, Yen-Ting
Keywords:	人機互動擴增實境大型語言模型多模態輸入人工智慧歧義消解注視輸入語音指令指向操作隱性指令 Human-Robot Interaction Extended Reality Large Language Models Multimodal Input Artificial Intelligence Disambiguation Gaze Input Voice Commands Pointing Implicit Commands
Date:	2025
Issue Date:	2025-11-03 14:35:34 (UTC+8)
Abstract:	多模態使用者輸入已被廣泛研究以提升人機互動（HRI）的精確度。然而，現有系統並未專注於讓使用者能像與人類或親密朋友互動那樣與機器人溝通，其中包含使用各種多模態輸入組合的靈活性，以及對輸入資料不精確性的容忍度，尤其是在隱性指令的情境下。例如，當使用者說「我想要那個」的同時，隨意地看向一瓶水，在人與人之間的溝通中這已足以構成一個隱性的語音指令，其通常伴隨著視線、手勢與／或身體語言，而非明確說出要做什麼。為了解決這一問題，我們提出了一種能理解使用者意圖的系統，透過語音、視線及指向手勢等多模態輸入，結合大型語言模型（LLMs），讓使用者能以靈活且不精確的方式與機器人互動。本系統運用 LLM 的消歧能力來過濾不相關的輸入模態與不精確的資料，產生一組可能的指令供使用者確認。該系統實現了輸入的靈活性與對不精確性的容忍，能更有效地詮釋隱性指令，減少使用者所需的時間、精力與注意力，甚至可以進一步發展為非語音輸入方式。我們在一個模擬的室內環境中進行了一項使用者行為研究，以觀察使用者如何自然地利用多模態輸入與機器人靈活地溝通，並據此獲得適用於視線與手指指向的角度範圍參數。接著，我們在擴增實境（XR）環境中對系統的效能進行了評估，並與其他方法進行比較。我們亦將其部署於實體機器人上，以展示其在真實世界應用中的潛力。 In natural human-to-human communication, multimodal user input is typically used to supplement explicit and complement implicit voice commands, with casualness allowing for flexible input modality combinations and tolerance for imprecise input data. For example, saying \textit{``I want that.''} with a casual glance at a bottle of water is clear enough in human-to-human communication as an implicit voice command accompanied by gaze and/or gestures, rather than an explicit one. To enable such a human-like interaction in human-robot interaction (HRI), we propose a system, IntenBot, to understand user intentions from flexible and imprecise multimodal input, including voice, gaze, and finger-pointing, in XR. The disambiguation capability of large language models (LLMs) is used to filter out irrelevant input modalities and imprecise input data, generating potential instructions for user confirmation. The flexible and imprecise multimodal input enables casual, human-like interaction with robots, reducing time, effort, and attention, and could also be used as non-voice input. We conducted an informative user behavior study in a simulated environment to understand users' natural behavior in flexibly interacting with a robot using multimodal input and to obtain appropriate angle range parameters for gaze and finger-pointing. An XR study was then performed to evaluate the performance of IntenBot, compared with other methods. We also deployed IntenBot on a physical robot to showcase its real-world applications.
Reference:	[1] M. Ahn, A. Brohan, N. Brown, et al., “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022 (cit. pp. 3, 8). [2] R. A. Bolt, ““put-that-there”: Voice and gesture at the graphics interface,” in Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’80, Seattle, Washington, USA: Association for Computing Machinery, 1980, pp. 262–270 (cit. pp. 3, 7). [3] I. Chatterjee, R. Xiao, and C. Harrison, “Gaze+gesture: Expressive, precise and targeted free-space interactions,” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, ser. ICMI ’15, Seattle, Washington, USA: Association for Computing Machinery, 2015, pp. 131–138 (cit. p. 7). [4] F. De La Torre, C. M. Fang, H. Huang, et al., “Llmr: Real-time prompting of interactive worlds using large language models,” in Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ser. CHI ’24, Honolulu, HI, USA: Association for Computing Machinery, 2024 (cit. p. 35). [5] Y. Ge, Y. Dai, R. Shan, et al., “Cocobo: Exploring large language models as the engine for end-user robot programming,” in 2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), IEEE, 2024, pp. 89–95 (cit. p. 8). [6] R. K. Ghamandi, R. K. Kattoju, Y. Hmaiti, et al., “Unlocking understanding: An investigation of multimodal communication in virtual reality collaboration,” in Proceedings of the CHI Conference on Human Factors in Computing Systems, ser. CHI ’24, Honolulu, HI, USA: Association for Computing Machinery, 2024 (cit. p. 7). [7] A. Goel, B. Fernando, F. Keller, and H. Bilen, “Who are you referring to? Coreference resolution in image narrations,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 247–15 258 (cit. p. 9). [8] Q. Gu, A. Kuwajerwala, S. Morin, et al., “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2024, pp. 5021–5028 (cit. p. 15). [9] D. Guo, A. Gupta, S. Agarwal, et al., “Gravl-bert: Graphical visual-linguistic representations for multimodal coreference resolution,” in Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 285–297 (cit. p. 9). [10] A. Helgert, C. Straßmann, and S. C. Eimler, “Unlocking potentials of virtual reality as a research tool in human-robot interaction: A wizard-of-oz approach,” in Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, ser. HRI ’24, Boulder, CO, USA: Association for Computing Machinery, 2024, pp. 535–539 (cit. pp. 18, 37). [11] Y. Hong, H. Zhen, P. Chen, et al., “3d-llm: Injecting the 3d world into large language models,” Advances in Neural Information Processing Systems, vol. 36, pp. 20 482– 20 494, 2023 (cit. p. 9). [12] W. Huang, F. Xia, T. Xiao, et al., “Inner monologue: Embodied reasoning through planning with language models,” arXiv preprint arXiv:2207.05608, 2022 (cit. p. 8). [13] S. Iba, C. J. Paredis, and P. K. Khosla, “Interactive multimodal robot programming,” The international journal of robotics research, vol. 24, no. 1, pp. 83–104, 2005 (cit. pp. 3, 7). [14] R. A. Izzo, G. Bardaro, and M. Matteucci, “Btgenbot: Behavior tree generation for robotic tasks with lightweight llms,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2024, pp. 9684–9690 (cit. p. 8). [15] F. Joublin, A. Ceravola, P. Smirnov, et al., “Copal: Corrective planning of robot actions with large language models,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2024, pp. 8664–8670 (cit. p. 8). [16] U. B. Karli, J.-T. Chen, V. N. Antony, and C.-M. Huang, “Alchemist: Llm-aided end-user development of robot applications,” in Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, ser. HRI ’24, Boulder, CO, USA: Association for Computing Machinery, 2024, pp. 361–370 (cit. p. 8). [17] A. A. Khan, J. Newn, J. Bailey, and E. Velloso, “Integrating gaze and speech for enabling implicit interactions,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, ser. CHI ’22, New Orleans, LA, USA: Association for Computing Machinery, 2022 (cit. p. 7). [18] C. Y. Kim, C. P. Lee, and B. Mutlu, “Understanding large-language model (llm)- powered human-robot interaction,” in Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, ser. HRI ’24, Boulder, CO, USA: Association for Computing Machinery, 2024, pp. 371–380 (cit. pp. 3, 8). [19] C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler, “What are you talking about? text-to-image coreference,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3558–3565 (cit. p. 9). [20] S. Kottur, S. Moon, A. Geramifard, and B. Damavandi, “Simmc 2.0: A task-oriented dialog dataset for immersive multimodal conversations,” arXiv preprint arXiv:2104.08667, 2021 (cit. p. 9). [21] K. Kuhn, V. Kersken, B. Reuter, N. Egger, and G. Zimmermann, “Measuring the accuracy of automatic speech recognition solutions,” ACM Trans. Access. Comput., vol. 16, no. 4, Jan. 2024 (cit. pp. 21, 37). [22] M. Kytö, B. Ens, T. Piumsomboon, G. A. Lee, and M. Billinghurst, “Pinpointing: Precise head- and eye-based target selection for augmented reality,” in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, ser. CHI ’18, Montreal QC, Canada: Association for Computing Machinery, 2018, pp. 1–14 (cit.p. 27). [23] J. Lee, J. Wang, E. Brown, et al., “Gazepointar: A context-aware multimodal voice assistant for pronoun disambiguation in wearable augmented reality,” in Proceedings of the CHI Conference on Human Factors in Computing Systems, ser. CHI ’24, Honolulu, HI, USA: Association for Computing Machinery, 2024 (cit. pp. 4, 9). [24] L.-H. Lin, Y. Cui, Y. Hao, F. Xia, and D. Sadigh, “Gesture-informed robot assistance via foundation models,” in 7th Annual Conference on Robot Learning, 2023 (cit. pp. 3, 4, 9, 35). [25] M. N. Lystbæk, P. Rosenberg, K. Pfeuffer, J. E. Grønbæk, and H. Gellersen, “Gazehand alignment: Combining eye gaze and mid-air pointing for interacting with menus in augmented reality,” Proc. ACM Hum.-Comput. Interact., vol. 6, no. ETRA, May 2022 (cit. p. 7). [26] K. Mahadevan, M. Sousa, A. Tang, and T. Grossman, ““grip-that-there”: An investigation of explicit and implicit task allocation techniques for human-robot collaboration,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, ser. CHI ’21, Yokohama, Japan: Association for Computing Machinery, 2021 (cit. pp. 7, 18, 37). [27] S. Mayer, G. Laput, and C. Harrison, “Enhancing mobile voice assistants with worldgaze,” in Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, ser. CHI ’20, Honolulu, HI, USA: Association for Computing Machinery, 2020, pp. 1–10 (cit. p. 9). [28] S. Mayer, V. Schwind, R. Schweigert, and N. Henze, “The effect of offset correction and cursor on mid-air pointing in real and virtual environments,” in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, ser. CHI ’18, Montreal QC, Canada: Association for Computing Machinery, 2018, pp. 1–13 (cit.p. 27). [29] D. Miniotas, O. Špakov, I. Tugoy, and I. S. MacKenzie,“Speech-augmented eye gaze interaction with small closely spaced targets,” in Proceedings of the 2006 Symposium on Eye Tracking Research & Applications, ser. ETRA ’06, San Diego, California: Association for Computing Machinery, 2006, pp. 67–72 (cit. p. 7). [30] D. Perzanowski, A. C. Schultz, W. Adams, E. Marsh, and M. Bugajska, “Building a multimodal human-robot interface,” IEEE intelligent systems, vol. 16, no. 1, pp. 16–21, 2001 (cit. pp. 3, 4, 7). [31] K. Pfeuffer, J. Alexander, M. K. Chong, and H. Gellersen, “Gaze-touch: Combining gaze with multi-touch for interaction on the same surface,” in Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, ser. UIST ’14, Honolulu, Hawaii, USA: Association for Computing Machinery, 2014, pp. 509–518 (cit. p. 7). [32] C. Qin, A. Song, L. Wei, and Y. Zhao, “A multimodal domestic service robot interaction system for people with declined abilities to express themselves,” Intelligent Service Robotics, vol. 16, no. 3, pp. 373–392, 2023 (cit. pp. 3, 7, 8). [33] C. P. Quintero, R. T. Fomena, A. Shademan, et al., “Sepo: Selecting by pointing as an intuitive human-robot command interface,” in 2013 IEEE International Conference on Robotics and Automation, IEEE, 2013, pp. 1166–1171 (cit. pp. 3, 4, 9). [34] K. Rana, J. Haviland, S. Garg, et al., “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” arXiv preprint arXiv:2307.06135, 2023 (cit. pp. 8, 9, 35). [35] F. Roider, L. Reisig, and T. Gross, “Just look: The benefits of gaze-activated voice input in the car,” in Adjunct Proceedings of the 10th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, ser. AutomotiveUI ’18, Toronto, ON, Canada: Association for Computing Machinery, 2018, pp. 210–214 (cit. p. 7). [36] T. Silver, V. Hariprasad, R. S. Shuttleworth, et al., “Pddl planning with pretrained large language models,” in NeurIPS 2022 foundation models for decision making workshop, 2022 (cit. p. 8). [37] J. R. Stroop, “Studies of interference in serial verbal reactions.,” Journal of experimental psychology, vol. 18, no. 6, p. 643, 1935 (cit. p. 19). [38] H.-R. Tsai, Y.-C. Chang, T.-Y. Wei, et al., “Guideband: Intuitive 3d multilevel force guidance on a wristband in virtual reality,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, ser. CHI ’21, Yokohama, Japan: Association for Computing Machinery, 2021 (cit. p. 19). [39] H.-R. Tsai, S.-K. Chiu, and B. Wang, “Gazenoter: Co-piloted ar note-taking via gaze selection of llm suggestions to match users’ intentions,” in Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, ser. CHI ’25, Association for Computing Machinery, 2025 (cit. pp. 13, 19). [40] H.-R. Tsai, C.-Y. Wu, L.-T. Huang, and Y.-P. Hung, “Thumbring: Private interactions using one-handed thumb motion input on finger segments,” in Proceedings of the 18th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct, ser. MobileHCI ’16, Florence, Italy: Association for Computing Machinery, 2016, pp. 791–798 (cit. p. 13). [41] S. H. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “Chatgpt for robotics: Design principles and model abilities,” IEEE Access, 2024 (cit. p. 8). [42] N. Wake, A. Kanehira, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi, “Chatgpt empowered long-step robot control in various environments: A case application,” IEEE Access, 2023 (cit. p. 8). [43] S. Waldherr, R. Romero, and S. Thrun, “A gesture based interface for human-robot interaction,” Autonomous Robots, vol. 9, pp. 151–173, 2000 (cit. p. 3). [44] C. Wang, S. Hasler, D. Tanneberg, et al., “Lami: Large language models for multimodal human-robot interaction,” in Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, ser. CHI EA ’24, Association for Computing Machinery, 2024 (cit. pp. 3, 4, 9). [45] J. Wei, B. Tag, J. R. Trippas, T. Dingler, and V. Kostakos, “What could possibly go wrong when interacting with proactive smart speakers? a case study using an esm application,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, ser. CHI ’22, New Orleans, LA, USA: Association for Computing Machinery, 2022 (cit. pp. 21, 37). [46] N. Wong and C. Gutwin, “Where are you pointing? the accuracy of deictic pointing in cves,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ser. CHI ’10, Atlanta, Georgia, USA: Association for Computing Machinery, 2010, pp. 1029–1038 (cit. p. 9). [47] N. Wong and C. Gutwin, “Support for deictic pointing in cves: Still fragmented after all these years’,” in Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, 2014, pp. 1377–1387 (cit. p. 9). [48] A. Xiao, A. Gupta, Y. Deng, K. Li, and D. Hsu, “Robi butler: Multimodal remote interaction with household robotic assistants,” in 2nd Workshop on Mobile Manipulation and Embodied Intelligence at ICRA 2024, 2024 (cit. pp. 3, 7, 9, 35). [49] Y. Yang, T. Zhou, K. Li, et al., “Embodied multi-modal agent trained by an llm from a parallel textworld,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 275–26 285 (cit. p. 8). [50] T. Yoneda, J. Fang, P. Li, et al., “Statler: State-maintaining language models for embodied reasoning,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2024, pp. 15 083–15 091 (cit. p. 8). [51] L. Zha, Y. Cui, L.-H. Lin, et al., “Distilling and retrieving generalizable knowledge for robot manipulation via language corrections,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2024, pp. 15 172–15 179 (cit.p. 8). [52] C. Zhao, S. Yuan, C. Jiang, et al., “Erra: An embodied representation and reasoning architecture for long-horizon language-conditioned manipulation tasks,” IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3230–3237, 2023 (cit. pp. 8, 19).
Description:	碩士國立政治大學資訊科學系 112753102
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0112753102
Data Type:	thesis
Appears in Collections:	[資訊科學系] 學位論文

Files in This Item:

File	Size	Format
310201.pdf	8344Kb	Adobe PDF	0	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback