English  |  正體中文  |  简体中文  |  Post-Print筆數 : 27 |  Items with full text/Total items : 118204/149236 (79%)
Visitors : 74246531      Online Users : 575
RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version
    Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/159209


    Title: 自動化的虛擬音樂家動畫: 從音樂訊號生成富有表現力的小提琴演奏系統
    Automatic Virtual Musician Animation: A System for Generating Expressive Violin Performance from Music Signals
    Authors: 林鼎崴
    Lin, Ting-Wei
    Contributors: 蘇黎
    劉昭麟

    Su, Li
    Liu, Chao-Lin

    林鼎崴
    Lin, Ting-Wei
    Keywords: 生成式AI
    多模態生成
    虛擬音樂家
    音樂表演生程
    表現力音樂表演
    Generative AI
    Cross-modal generation
    VTuber
    Virtual musician
    Music-to-Performance generation
    Expressive music performance
    Date: 2025
    Issue Date: 2025-09-01 15:47:58 (UTC+8)
    Abstract: 基於深度生成模型的無動作捕捉(MOCAP-free)音樂到表演生成技術已成為下一代動畫技術的有力解決方案,可實現無需依賴動作捕捉即可創建音樂表演動畫。然而,構建此類系統面臨重大挑戰,特別是在整合多個獨立模型分別負責不同方面的角色控制時,例如用於有情緒的臉部表情生成和用於樂器演奏的指法生成。此外,現有的大部分方法主要關注在只有人類表演的生成模型,忽視了人類與樂器之間的互動在一個實現富有表現力和真實感的音樂表演中的關鍵作用。
    為了解決這些問題,本論文提出了一個全面的系統,用於生成富有表現力的虛擬小提琴表演。該系統將五個關鍵模組: 表現力音樂合成、臉部表情生成、指法生成、身體動作生成和影片鏡頭生成,整合到一個統一的框架中。透過不依賴對動作捕捉技術的需求與建立人與樂器之間的生成模型,本研究推進了無動作捕捉內容到表演生成的研究。此論文的實驗包括定量分析和主觀研究,證明了該系統能夠生成逼真、富有表現力和同步的虛擬表演動畫的能力,為提升互動應用如VTubing和虛擬音樂會的品質。
    Motion-capture (MOCAP)-free music-to-performance generation using deep generative models has emerged as a promising solution for the next generation of animation technologies, enabling the creation of animated musical performances without relying on motion capture. However, building such systems presents substantial challenges, particularly in integrating multiple independent models responsible for different aspects of avatar control, such as facial expression generation for emotive dynamics and fingering generation for instrumental articulation. Moreover, most existing approaches primarily focus on human-only performance generation, overlooking the critical role of human-instrument interactions in achieving expressive and realistic musical performances.
    To address these limitations, this dissertation proposes a comprehensive system for generating expressive virtual violin performances. The system integrates five key modules—expressive music synthesis, facial expression generation, fingering generation, body movement generation, and video shot generation—into a unified framework. By eliminating the need for MOCAP and explicitly modeling human-instrument interactions, this work advances the field of MOCAP-free content-to-performance generation. Extensive experiments, including quantitative analyses and user studies, demonstrate the system's ability to produce realistic, expressive, and synchronized virtual performances, paving the way for interactive applications such as VTubing and virtual concerts.
    Reference: Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
    Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, et al. Theano: A python framework for fast computation of mathematical expressions. arXiv e-prints, pages arXiv–1605, 2016.
    Dale Andrews. Digital overdrive: Communications & multimedia technology 2011. Digital Overdrive, 2011.
    Deepali Aneja, Alex Colburn, Gary Faigin, Linda Shapiro, and Barbara Mones. Modeling stylized character expressions via deep learning. In 13th Asian Conference on Computer Vision (ACCV), pages 136–153. Springer, 2017.
    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
    David Radford Bakker and Frances Heritage Martin. Musical chords and emotion: Major and minor triads are processed for emotion. Cognitive, Affective, & Behavioral Neuroscience, 15:15–31, 2015.
    Matteo Balliauw, Dorien Herremans, Daniel Palhazi Cuervo, and Kenneth Sörensen. A variable neighborhood search algorithm to generate piano fingerings for polyphonic sheet music. International Transactions in Operational Research, 24(3):509–535, 2017.
    Isabel Barbancho, Lorenzo J Tardon, Simone Sammartino, and Ana M Barbancho. Inharmonicity-based method for the automatic generation of guitar tablature. IEEE Transactions on Audio, Speech, and Language Processing, 20(6):1857–1868, 2012.
    Daniel Black. The virtual ideal: Virtual idols, cute technology and unclean biology. Continuum, 22(1):37–50, 2008.
    Daniel Black. The virtual idol: Producing and consuming digital femininity. Idols and celebrity in Japanese media culture, pages 209–28, 2012.
    Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
    Alysha Bogaers, Zerrin Yumak, and Anja Volk. Music-driven animation generation of expressive musical gestures. In Companion Publication of the 2020 International Conference on Multimodal Interaction, pages 22–26, 2020.
    Luisa Bonfiglioli, Roberto Caterina, Iolanda Incasa, Mario Baroni, et al. Facial expression and piano performance. In Proceedings of the 9th International Conference on Music Perception and Cognition, pages 1355–1360. Citeseer, 2006.
    Sergio Canazza, Giovanni De Poli, Carlo Drioli, Antonio Roda, and Alvise Vidolin. Modeling and control of expressiveness in music performance. Proceedings of the IEEE, 92 (4):686–701, 2004.
    Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
    Roberto Caterina, Luisa Bonfiglioli, Mario Baroni, ANNA Addessi, et al. Mimic expression and piano performance. In Proceedings of the 8th International Conference on Music Perception and Cognition, volume 4, page 2004. August, 2004.
    Hsien-Lun Chen and Li Su. Music-to-body-movement generation with recurrent neural network and reinforcement learning. In ISMIR LBD, 2021. Jiali Chen, Changjie Fan, Zhimeng Zhang, Gongzheng Li, Zeng Zhao, Zhigang Deng, and Yu Ding. A music-driven deep generative adversarial model for guzheng playing animation. IEEE Transactions on Visualization and Computer Graphics, 29(2):1400–1414, 2021.
    Ke Chen, Hao-Wen Dong, Yi Luo, Julian McAuley, Taylor Berg-Kirkpatrick, Miller Puckette, and Shlomo Dubnov. Improving choral music separation through expressive synthesized data from sampled instruments. In Ismir 2022 Hybrid Conference, 2022.
    Yihua Cheng, Haofei Wang, Yiwei Bao, and Feng Lu. Appearance-based gaze estimation with deep learning: A review and benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
    Vincent KM Cheung, Hsuan-Kai Kao, Li Su, et al. Semi-supervised violin fingering generation using variational autoencoders. In ISMIR, pages 113–120, 2021.
    Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
    Hsin-Min Chou, Ting-Wei Lin, Jen-Chun Lin, Ching-Te Chiu, and Li Su. CCVS: A dataset for concert videography research. In ISMIR LBD, 2022.
    Hayley L Cocker and James Cronin. Charismatic authority and the youtuber: Unpacking the new cults of personality. Marketing theory, 17(4):455–472, 2017.
    Mauro Conti, Jenil Gathani, and Pier Paolo Tricomi. Virtual influencers in online social media. IEEE Communications Magazine, 60(8):86–91, 2022.
    Jane W Davidson. Bodily movement and facial actions in expressive musical performance by solo and duo instrumentalists: Two distinctive case studies. Psychology of Music, 40(5):595–633, 2012.
    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
    Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, et al. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12, 2016.
    Hao-Wen Dong, Cong Zhou, Taylor Berg-Kirkpatrick, and Julian McAuley. Deep performer: Score-to-audio music performance synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 951–955. IEEE, 2022.
    Paul Ekman and Wallace V Friesen. Facial action coding system. Environmental Psychology & Nonverbal Behavior, 1978.
    Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
    Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, and Adam Roberts. Ddsp: Differentiable digital signal processing. arXiv preprint arXiv:2001.04643, 2020.
    Sefik Emre Eskimez, Ross K Maddox, Chenliang Xu, and Zhiyao Duan. Generating talking face landmarks from speech. In Latent Variable Analysis and Signal Separation: 14th International Conference, LVA/ICA, pages 372–381. Springer, 2018.
    Sefik Emre Eskimez, Ross K Maddox, Chenliang Xu, and Zhiyao Duan. Noise-resilient training method for face landmark generation from speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:27–38, 2019.
    Leonhard Euler. Tentamen novae theoriae musicae ex certissimis harmoniae principiis dilucide expositae. Ex typographia Academiae scientiarum, 1739.
    Júlio César Valente Ferreira, Rafael Dirques David Regis, Paula Gonçalves, Gabriela Rodrigues Diniz, and Vitor Pedro da Silva Castelo Tavares. Vtuber concept review: The new frontier of virtual entertainment. In Proceedings of the 24th Symposium on Virtual and Augmented Reality, pages 83–96, 2022.
    Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36 (4):193–202, 1980.
    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
    Masataka Goto, Tomoyasu Nakano, Shuuji Kajita, Yosuke Matsusaka, Shin’ichiro Nakaoka, and Kazuhito Yokoi. Vocalistener and vocawatcher: Imitating a human singer by using signal processing. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5393–5396. IEEE, 2012.
    Peter Grosche, Meinard Müller, and Frank Kurth. Cyclic tempogram—a mid-level tempo representation for musicsignals. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5522–5525. IEEE, 2010.
    Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image and vision computing, 28(5):807–813, 2010.
    Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, and Zejun Ma. Bytesing: A chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and wavernn vocoders. In 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pages 1–5. IEEE, 2021.
    Chitralekha Gupta, Haizhou Li, and Masataka Goto. Deep learning approaches in topics of singing information processing. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:2422–2451, 2022.
    Patrick Gwillim-Thomas. The actualizing platform: 2.5 dimensions in the vtuber media ecology. Mechademia, 15(2):49–69, 2023.
    William A Hamilton, Oliver Garretson, and Andruid Kerne. Streaming on twitch: fostering participatory communities of play within live mixed media. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 1315–1324, 2014.
    Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck. Onsets and frames: Dual-objective piano transcription. arXiv preprint arXiv:1710.11153, 2017.
    Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the maestro dataset. arXiv preprint arXiv:1810.12247, 2018.
    Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, and Jesse Engel. Multi-instrument music synthesis with spectrogram diffusion. arXiv preprint arXiv:2206.05408, 2022.
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a.
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In 14th European Conference on Computer Vision (ECCV), pages 630–645. Springer, 2016b.
    Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Lecture 6a overview of mini–batch gradient descent. Coursera Lecture slides https://class.coursera.org/neuralnets2012-001/lecture, [Online], 2012.
    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
    Asuka Hirata, Keitaro Tanaka, Ryo Shimamura, and Shigeo Morishima. Bowing-net: Motion generation for string instruments based on bowing information. In ACM SIGGRAPH Posters, pages 1–2. 2021.
    Asuka Hirata, Keitaro Tanaka, Masatoshi Hamanaka, and Shigeo Morishima. Audiodriven violin performance animation with clear fingering and bowing. In ACM SIGGRAPH Posters, pages 1–2. 2022.
    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
    Mattias Holmbom. The youtuber: A qualitative study of popular content creators, 2015.
    Yukiya Hono, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda. Sinsy: A deep neural network-based singing voice synthesis system. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2803–2815, 2021.
    John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
    Jan Hruska and Petra Maresova. Use of social media platforms among adults in the united states—behavior on social media. Societies, 10(1):27, 2020.
    Yu-Fen Huang, Nikki Moran, Simon Coleman, Jon Kelly, Shun-Hwa Wei, Po-Yin Chen, Yun-Hsin Huang, Tsung-Ping Chen, Yu-Chia Kuo, Yu-Chi Wei, Chih-Hsuan Li, Da-Yu Huang, Hsuan-Kai Kao, Ting-Wei Lin, and Li Su. Mosa: Music motion with semantic annotation dataset for cross-modal music processing. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.
    David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106, 1962.
    Tzu-Yun Hung, Jui-Te Wu, Yu-Chia Kuo, Yo-Wei Hsiao, Ting-Wei Lin, and Li Su. A study on synthesizing expressive violin performances: Approaches and comparisons. In The 29th International Conference on Technologies and Applications of Artificial Intelligence, 2024.
    Yi-Hsin Jen, Tsung-Ping Chen, Shih-Wei Sun, and Li Su. Positioning left-hand movement in violin performance: A system and user study of fingering pattern generation. In Proceedings of the 26th International Conference on Intelligent User Interfaces, pages 208–212, 2021.
    Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012.
    Shulei Ji, Jing Luo, and Xinyu Yang. A comprehensive survey on deep music generation: Multi-level representations, algorithms, evaluations, and future directions. arXiv preprint arXiv:2011.06801, 2020.
    Nicolas Jonason. The control-synthesis approach for making expressive and controllable neural music synthesizers, 2020.
    Hsuan-Kai Kao and Li Su. Temporally guided music-to-body-movement generation. In Proceedings of the 28th ACM International Conference on Multimedia, pages 147–155, 2020.
    Hideki Kenmochi. Vocaloid and hatsune miku phenomenon in japan. In Interdisciplinary Workshop on Singing Voice, 2010.
    Jong Wook Kim, Rachel Bittner, Aparna Kumar, and Juan Pablo Bello. Neural music synthesis for flexible timbre control. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 176–180. IEEE, 2019.
    Davis E King. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10:1755–1758, 2009.
    Hajime Kobayashi and Takashi Taguchi. Virtual idol hatsune miku: Case study of new production/consumption phenomena generated by network effects in japan’s online environment. Markets, Globalization & Development Review, 3(4), 2019.
    Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, and Yuxuan Wang. High-resolution piano transcription with pedals by regressing onset and offset times. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3707–3717, 2021a.
    Ruolin Kong, Ziheng Qi, and Sicheng Zhao. Difference between virtual idols and traditional entertainment from technical perspectives. In 3rd International Conference on economic Management and cultural industry (ICEMCI), pages 344–349. Atlantis Press, 2021b.
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
    Kundan Kumar, Rithesh Kumar, Thibault De Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre De Brebisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
    Theodoros Kyriakou, M Álvarez de la Campa Crespo, Andreas Panayiotou, Yiorgos Chrysanthou, Panayiotis Charalambous, and Andreas Aristidou. Virtual instrument performances (vip): A comprehensive review. In Computer Graphics Forum, page e15065. Wiley Online Library, 2024.
    Riadh Ladhari, Elodie Massa, and Hamida Skandrani. Youtube vloggers’popularity and influence: The roles of homophily, emotional attachment, and expertise. Journal of Retailing and Consumer Services, 54:102027, 2020.
    Ka Yan Lam. The hatsune miku phenomenon: More than a virtual j-pop diva. Journal of Popular Culture, 49(5), 2016.
    Linh K Le. Examining the rise of hatsune miku: The first international virtual idol. The UCI Undergraduate Research Journal, 13(1):1–12, 2014.
    Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
    Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, MingHsuan Yang, and Jan Kautz. Dancing to music. Advances in neural information processing systems, 32, 2019.
    Bochen Li, Akira Maezawa, and Zhiyao Duan. Skeleton plays piano: Online generation of pianist body movements from midi performance. In ISMIR, pages 218–224, 2018.
    Meng Li, William Hsu, Xiaodong Xie, Jason Cong, and Wen Gao. Sacnn: Self-attention convolutional neural network for low-dose ct denoising with self-supervised perceptual loss network. IEEE transactions on medical imaging, 39(7):2289–2301, 2020.
    Min Susan Li, Maciej Tomczak, Mark Elliot, Adrian Bradbury, Tom Goodman, Diar Abdulkarim, Max Di Luca, Jason Hockman, and Alan Wing. Tempo change and leadership in ensemble synchronisation: a case study. 2023.
    Pei-Ching Li, Li Su, Yi-Hsuan Yang, Alvin WY Su, et al. Analysis of expressive musical terms in violin using score-informed and expression-based audio features. In ISMIR, pages 809–815, 2015.
    Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13401–13412, 2021.
    Sen Liang, Zhize Zhou, Rong Li, Juyong Zhang, and Hujun Bao. Talkingflow: Talking facial landmark generation with multi-scale normalizing flow network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4628–4632. IEEE, 2022.
    Jen-Chun Lin, Wen-Li Wei, Yen-Yu Lin, Tyng-Luh Liu, and Hong-Yuan Mark Liao. Learning from music to visual storytelling of shots: A deep interactive learning mechanism. In Proceedings of the 28th ACM International Conference on Multimedia, pages 102–110, 2020a.
    Ting-Wei Lin, Chao-Lin Liu, and Li Su. Audio-driven facial landmark generation in violin performance using 3dcnn network with self attention model. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
    Wei-Yang Lin, Yu-Chiang Frank Wang, and Li Su. Enhancing violin fingering generation through audio-symbolic fusion. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 811–815. IEEE, 2024.
    Yuen-Jen Lin, Hsuan-Kai Kao, Yih-Chih Tseng, Ming Tsai, and Li Su. A human-computer duet system for music performance. In Proceedings of the 28th ACM International Conference on Multimedia, pages 772–780, 2020b.
    Jen-Yu Liu and Yi-Hsuan Yang. Event localization in music auto-tagging. In Proceedings of the 24th ACM international conference on Multimedia, pages 1048–1057, 2016.
    Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 11020–11028, 2022.
    Jun-Wei Liu, Hung-Yi Lin, Yu-Fen Huang, Hsuan-Kai Kao, and Li Su. Body movement generation for expressive violin performance applying neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3787–3791. IEEE, 2020.
    Peiling Lu, Jie Wu, Jian Luan, Xu Tan, and Li Zhou. Xiaoicesing: A high-quality and integrated singing voice synthesis system. arXiv preprint arXiv:2006.06261, 2020.
    Zhicong Lu, Chenxinran Shen, Jiannan Li, Hong Shen, and Daniel Wigdor. More kawaii than a real-person live streamer: understanding how the otaku community engages with and perceives virtual youtubers. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–14, 2021.
    Esteban Maestre, Rafael Ramírez, Stefan Kersten, and Xavier Serra. Expressive concatenative synthesis by reusing samples from real performance recordings. Computer Music Journal, 33(4):23–42, 2009.
    Akira Maezawa, Katsutoshi Itoyama, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G Okuno. Automated violin fingering transcription through analysis of an audio recording. Computer Music Journal, 36(3):57–72, 2012.
    Roslina Mamat, Roswati Abdul Rashid, Rokiah Paee, and Normah Ahmad. Vtubers and anime culture: A case study of japanese learners in two public universities in malaysia. International Journal of Health Sciences, (II):11958–11974, 2022.
    Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5:115–133, 1943.
    Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In SciPy, pages 18–24, 2015.
    Gustavo Mercado. The filmmaker’s eye: Learning (and breaking) the rules of cinematic composition. Routledge, 2013.
    Amit Moryossef, Yanai Elazar, and Yoav Goldberg. At your fingertips: Extracting piano fingering instructions from videos. arXiv preprint arXiv:2303.03745, 2023.
    Wakana Nagata, Shinji Sako, and Tadashi Kitamura. Violin fingering estimation according to skill level based on hidden markov model. In ICMC, 2014.
    Muhammad Nizami and Dessi Puji Lestari. A dt-neural parametric violin synthesizer. In 2021 International Conference on Electrical Engineering and Informatics (ICEEI), pages 1–6. IEEE, 2021.
    Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, and Michael Neff. A comprehensive review of data-driven co-speech gesture generation. In Computer Graphics Forum, volume 42, pages 569–596. Wiley Online Library, 2023.
    Hao Pan, Zhi-Pei Liang, and Thomas S Huang. Estimation of the joint probability of multisensory signals. Pattern Recognition Letters, 22(13):1431–1437, 2001.
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
    Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7753–7762, 2019.
    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
    Pedro Ramoneda, Dasaem Jeong, Eita Nakamura, Xavier Serra, and Marius Miron. Automatic piano fingering from partially annotated scores using autoregressive neural networks. In Proceedings of the 30th ACM International Conference on Multimedia, pages 6502–6510, 2022.
    Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, and Tie-Yan Liu. Deepsinger: Singing voice synthesis with data mined from the web. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1979–1989, 2020.
    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention (MICCAI), pages 234–241. Springer, 2015.
    Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
    David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
    Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE international conference on computer vision workshops, pages 397–403, 2013.
    Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
    Chi-Ching Shih, Pei-Ching Li, Yi-Ju Lin, Yu-Lin Wang, Alvin WY Su, Li Su, Yi-Hsuan Yang. Analysis and synthesis of the violin playing style of heifetz and oistrakh. In Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), 2017.
    Akihiko Shirai. Reality: Broadcast your virtual beings from everywhere. In ACM SIGGRAPH, pages 1–2. 2019.
    Eli Shlizerman, Lucio Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. Audio to body dynamics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7574–7583, 2018.
    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    Stanley Smith Stevens, John Volkmann, and Edwin B Newman. A scale for the measurement of the psychological magnitude pitch. The journal of the acoustical society of america, 8(3):185–190, 1937.
    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
    Makoto Tachibana, Shin’ichiro Nakaoka, and Hideki Kenmochi. A singing robot realized by a collaboration of vocaloid and cybernetic human hrp-4c. Proc. of InterSinging 2010, pages 9–14, 2010.
    Fujishima Takuya. Realtime chord recognition of musical sound: Asystem using common lisp music. In Proceedings of the International Computer Music Conference, 1999.
    Hao Hao Tan, Yin-Jyun Luo, and Dorien Herremans. Generative modelling for controllable audio synthesis of expressive piano performance. arXiv preprint arXiv:2006.09833, 2020.
    William Forde Thompson, Phil Graham, and Frank A Russo. Seeing music performance: Visual influences on perception and experience. 2005.
    Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov. Label Studio: Data labeling software, 2020-2024. URL https:// github.com/HumanSignal/label-studio. Open source software available from https://github.com/HumanSignal/label-studio.
    Mukhiddin Toshpulatov, Wookey Lee, and Suan Lee. Talking human face generation: A survey. Expert Systems with Applications, 219:119678, 2023.
    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
    Chia-Jung Tsay. Sight over sound in the judgment of music performance. Proceedings of the National Academy of Sciences, 110(36):14580–14585, 2013.
    Daniel R Tuohy and Walter D Potter. A genetic algorithm for the automatic generation of playable guitar tablature. In ICMC, pages 499–502, 2005.
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
    George Waddell and Aaron Williamon. Eye of the beholder: Stage entrance behavior and facial expression affect continuous quality ratings in music performance. Frontiers in Psychology, 8:513, 2017.
    Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. Robot: Robustness-oriented testing for deep learning systems. In IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 300–311. IEEE, 2021.
    Yuxin Wang, Linsen Song, Wayne Wu, Chen Qian, Ran He, and Chen Change Loy. Talking faces: Audio-to-video face generation. In Handbook of digital face manipulation and detection: From deepfakes to morphing attacks, pages 163–188. Springer International Publishing Cham, 2022.
    Wen-Li Wei, Jen-Chun Lin, Tyng-Luh Liu, Yi-Hsuan Yang, Hsin-Min Wang, Hsiao-Rong Tyan, and Hong-Yuan Mark Liao. Seethevoice: Learning from music to visual storytelling of shots. In IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. Ieee, 2018.
    Wen-Li Wei, Jen-Chun Lin, Tyng-Luh Liu, Hsiao-Rong Tyan, Hsin-Min Wang, and Hong Yuan Mark Liao. Learning to visualize music through shot sequence for automatic concert video mashup. IEEE Transactions on Multimedia, 23:1731–1743, 2020.
    Andrew Wiggins and Youngmoo E Kim. Guitar tablature estimation with a convolutional neural network. In ISMIR, pages 284–291, 2019.
    Yusong Wu, Ethan Manilow, Yi Deng, Rigel Swavely, Kyle Kastner, Tim Cooijmans, Aaron Courville, Cheng-Zhi Anna Huang, and Jesse Engel. MIDI-DDSP: Detailed control of musical performance via hierarchical modeling. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=UseMOjWENv.
    Wei Xiang, YIN Haoteng, He Wang, and Xiaogang Jin. Socialcvae: predicting pedestrian trajectory via interaction conditioned latents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6216–6224, 2024.
    Wei Xue, Yiwen Wang, Qifeng Liu, and Yike Guo. Learn to sing by listening: Building controllable virtual singer by unsupervised learning from voice recordings. arXiv preprint arXiv:2305.05401, 2023.
    Chih-Hong Yang, Pei-Ching Li, Alvin WY Su, Li Su, Yi-Hsuan Yang. Automatic violin synthesis using expressive musical term features. In Proc. of the 19th International Conference on Digital Audio Effects (DAFx), 2016.
    Yi Yang and Deva Ramanan. Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence, 35(12):2878–2890, 2012.
    YouTube Culture & Trends. VTubers | YouTube Culture & Trends Report, Dec. 2020. URL https://www.youtube.com/watch?v=1endAJC_CC8.
    YouTube Culture & Trends. Culture & Trends Report 2023, Jun. 2023. URL https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/YouTube_Culture__Trends_Report_2023.pdf.
    Yongyi Zang, Yi Zhong, Frank Cwitkowitz, and Zhiyao Duan. Synthtab: Leveraging synthesized data for guitar tablature transcription. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1286–1290. IEEE, 2024.
    Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. From facial expression recognition to interpersonal relation prediction. International Journal of Computer Vision, 126:550–569, 2018.
    Rui Zhen, Wenchao Song, Qiang He, Juan Cao, Lei Shi, and Jia Luo. Human-computer interaction system: A survey of talking-head generation. Electronics, 12(1):218, 2023.
    Hang Zhou, Jihao Liu, Ziwei Liu, Yu Liu, and Xiaogang Wang. Rotate-and-render: Unsupervised photorealistic face rotation from single-view images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5911–5920, 2020.
    Xin Zhou. Virtual youtuber kizuna ai: co-creating human-non-human interaction and celebrity-audience relationship, 2020. Available at https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9009369&fileOId=9009370.
    Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. Deep audio-visual learning: A survey. International Journal of Automation and Computing, 18(3):351–376, 2021.
    Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li. Face alignment across large poses: A 3d solution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 146–155, 2016.
    Wenlin Zhuang, Jinwei Qi, Peng Zhang, Bang Zhang, and Ping Tan. Text/speech-driven full-body animation. arXiv preprint arXiv:2205.15573, 2022.
    Description: 博士
    國立政治大學
    社群網路與人智計算國際研究生博士學位學程(TIGP)
    106761502
    Source URI: http://thesis.lib.nccu.edu.tw/record/#G0106761502
    Data Type: thesis
    Appears in Collections:[社群網路與人智計算國際研究生博士學位學程(TIGP)] 學位論文

    Files in This Item:

    File Description SizeFormat
    150201.pdf13396KbAdobe PDF0View/Open


    All items in 政大典藏 are protected by copyright, with all rights reserved.


    社群 sharing

    著作權政策宣告 Copyright Announcement
    1.本網站之數位內容為國立政治大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用,惟仍請適度,合理使用本網站之內容,以尊重著作權人之權益。商業上之利用,則請先取得著作權人之授權。
    The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

    2.本網站之製作,已盡力防止侵害著作權人之權益,如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員(nccur@nccu.edu.tw),維護人員將立即採取移除該數位著作等補救措施。
    NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback