政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/146305

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 117578/148609 (79%)
Visitors : 71301260 Online Users : 8430

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 商學院 > 統計學系 > 學位論文 > Item 140.119/146305

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/146305

Title:	一個將表格型資料轉換成影像的監督式方法用於以卷積神經網絡為基礎的深度學習預測 The supervised approach for converting tabular data into images for CNN-based deep learning prediction
Authors:	凃于珊 Tu, Yu-Shan
Contributors:	吳漢銘凃于珊 Tu, Yu-Shan
Keywords:	組內和組間分析法監督式距離卷積神經網絡 Within and between analysis Supervised distance matrix Convolutional neural network
Date:	2023
Issue Date:	2023-08-02 13:03:47 (UTC+8)
Abstract:	在處理表格型資料的分類預測問題時，如果使用傳統的機器學習方法，如決策樹、隨機森林和支持向量機，我們通常需要進行資料特徵擷取和預處理。然而，近年來，有研究提出將表格型資料轉換成圖像，然後利用卷積神經網絡模型來訓練和預測轉換後的資料圖像。這種方法不僅能省去前述的預處理步驟，還能獲得更好的預測效果。在這些方法中，表格資料圖像生成器（Image Generator for Tabular Data，簡稱 IGTD）透過最小化特徵距離矩陣與目標圖像像素位置距離矩陣之間的差異，將表格型資料中的每個特徵（變數）分配到圖像中的唯一像素位置，從而生成每一個樣本的圖像。在這些圖像中，像素強度反映了樣本中相對應特徵（變數）的值。IGTD 方法不需要資料領域知識，且能提供更佳的特徵鄰域結構。本研究基於 IGTD 的特性，引入了監督式距離計算的概念，並在生成圖像的過程中加入資料類別資訊，以提高圖像分類預測的準確性。首先，我們根據資料類別資訊，採用組內和組間分析法（Within and Between Analysis，簡稱WABA），計算特徵間的不同相關係數及其對應的距離。然後，我們利用這些由不同相關係數生成的圖像進行資料擴充，以增加樣本數，解決資料樣本數遠小於特徵數的問題。此外，我們也考慮了不同相關係數轉換成距離的轉換公式，以了解其對資料生成圖像的影響，以及對卷積神經網絡模型結果的影響。我們將所提出的方法應用於多個實際的基因表達資料集，結果顯示，新方法優於 IGTD。除了能顯著提升卷積神經網絡模型的預測準確性外，同時也擴展了卷積神經網絡在表格型資料應用的範疇。 When dealing with classification prediction problems of tabular data, traditional machine learning methods such as decision trees, random forests, and support vector machines usually require data feature extraction and preprocessing. However, recent research has proposed converting tabular data into images, and then using convolutional neural network models to train and predict the converted data images. This method not only eliminates the aforementioned preprocessing steps but also achieves better prediction results. Among these methods, the Image Generator for Tabular Data (IGTD) minimizes the difference between the feature distance matrix and the target image pixel position distance matrix, assigning each feature (variable) in the tabular data to a unique pixel position in the image, thereby generating an image for each sample. In these images, the pixel intensity reflects the value of the corresponding feature (variable) in the sample. The IGTD method does not require domain knowledge of the data and can provide a better feature neighborhood structure. Based on the characteristics of IGTD, this study introduces the concept of supervised distance calculation and incorporates data category information during the image generation process to improve the accuracy of image classification prediction. First, we use the Within and Between Analysis (WABA) based on data category information to calculate different correlation coefficients and their corresponding distances between features. Then, we use the images generated by these different correlation coefficients for data augmentation to increase the number of samples and solve the problem of the number of data samples being far less than the number of features. In addition, we also consider different conversion formulas for converting correlation coefficients into distances to understand their impact on data image generation and the results of the convolutional neural network model. We applied the proposed method to multiple actual gene expression datasets. The results show that the new method is superior to IGTD. In addition to significantly improving the prediction accuracy of the convolutional neural network model, it also expands the application of convolutional neural networks to the tabular data.
Reference:	Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., et al. (2000). Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403(6769), 503–511. Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., den Boer, M. L., Minden, M. D., Sallan, S. E., Lander, E. S., Golub, T. R., & Korsmeyer, S. J.(2002). Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature genetics,0(1), 41–47. Bazgir, O., Zhang, R., Dhruba, S. R., Rahman, R., Ghosh, S., & Pal, R. (2020). Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nature communications, 11(1), 4391. Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. Neural Networks: Tricks of the Trade: Second Edition, (pp. 437–478). Bertucci, F., Salas, S., Eysteries, S., Nasser, V., Finetti, P., Ginestier, C., CharafeJauffret, E., Loriod, B., Bachelart, L., Montfort, J., et al. (2004). Gene expression profiling of colon cancer by dna microarrays and correlation with histoclinical parameters. Oncogene, 23(7), 1377–1391. Chollet, F. (2021). Deep learning with Python. Simon and Schuster Ciregan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3642–3649).: IEEE. Dansereau, F., Alutto, J. A., & Yammarino, F. J. (1984). Theory testing in organizational behavior: The varient approach. Prentice Hall. Díaz-Uriarte, R. (2005). Supervised methods with genomic data: a review and cautionary view. Data analysis and visualization in genomics and proteomics,(pp. 193–214). Gu, Q., Li, Z., & Han, J. (2012). Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725. Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., & Cuadros, J. (2016). Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22), 2402–2410. Hart, P. E., Stork, D. G., & Duda, R. O. (2000). Pattern classification. Wiley Hoboken. Hua, J., Xiong, Z., Lowey, J., Suh, E., & Dougherty, E. R. (2005). Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21(8), 1509–1515. Jirapech-Umpai, T. & Aitken, S. (2005). Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC bioinformatics, 6(1), 1–11. Kamnitsas, K., Ledig, C., Newcombe, V. F., Simpson, J. P., Kane, A. D., Menon, D. K., Rueckert, D., & Glocker, B. (2017). Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis, 36, 61–78. Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., et al. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673–679. Kim, K., Zhang, S., Jiang, K., Cai, L., Lee, I.-B., Feldman, L. J., & Huang, H. (2007). Measuring similarities between gene expression profiles through new data transformations. BMC bioinformatics, 8, 1–14. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. Lee, J. W., Lee, J. B., Park, M., & Song, S. H. (2005). An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48(4), 869–885. Li, Y., Campbell, C., & Tipping, M. (2002). Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics, 18(10), 1332–1339. Ma, S. & Zhang, Z. (2018). Omicsmapnet: Transforming omics data to take advantage of deep convolutional neural network for discovery. arXiv preprint arXiv:1804.05283. Odena, A., Olah, C., & Shlens, J. (2017). Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning (pp. 2642–2651).: PMLR. Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C., et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870), 436–442. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229. Sharma, A., Vans, E., Shigemizu, D., Boroevich, K. A., & Tsunoda, T. (2019). Deepinsight: A methodology to transform a non-image data to an image for convolution neural network architecture. Scientific reports, 9(1), 11399. Simonyan, K. & Zisserman, A. (2014a). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Simonyan, K. & Zisserman, A. (2014b). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2), 203–209. Wainberg, M., Merico, D., Delong, A., & Frey, B. J. (2018). Deep learning in biomedicine. Nature biotechnology, 36(9), 829–838. Wu, H.-M., Tien, Y.-J., Ho, M.-R., Hwu, H.-G., Lin, W.-c., Tao, M.-H., & Chen, C.-h. (2018). Covariate-adjusted heatmaps for visualizing biological data via correlation decomposition. Bioinformatics, 34(20), 3529–3538. Yeung, K. Y., Bumgarner, R. E., & Raftery, A. E. (2005). Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics, 21(10), 2394–2402. Zhu, Y., Brettin, T., Xia, F., Partin, A., Shukla, M., Yoo, H., Evrard, Y. A., Doroshow, J. H., & Stevens, R. L. (2021). Converting tabular data into images for deep learning with convolutional neural networks. Scientific eports, 11(1), 11325.
Description:	碩士國立政治大學統計學系 110354011
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0110354011
Data Type:	thesis
Appears in Collections:	[統計學系] 學位論文

Files in This Item:

File	Description	Size	Format
401101.pdf		14087Kb	Adobe PDF2	0	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback