Please use this identifier to cite or link to this item:
A Virtual Multi-label Approach to Imbalanced Data Classification
Imbalanced classification problem
|Issue Date: ||2020-09-02 11:41:50 (UTC+8)|
|Abstract: ||大多數監督式學習方法對於不平衡資料的分類預測，在建構演算法的過程中，會以多數類別當作主要學習對象，因而犧牲少數類別，使分類器的性能下降。基於上述問題，本研究使用一個新的分類方法，結合Equal Kmeans的分群方式，以虛擬多類別來處理不平衡的問題，並且與常用的處理方式，包括抽樣方法中的過度抽樣、低額抽樣及SMOTE；分類器方法中的SVM及One-Class SVM進行比較。研究結果顯示本研究方法隨著資料不平衡程度的上升，會有越好的表現，且逐漸優於其他方法。|
To predict the classification of imbalanced data, most of the supervised learning methods will use the majority class as the main learning object to develop a learning algorithm. Therefore, it would lose the information on the minority class and reduce the performance of the classifier. Based on the problem above, a new classification approach with the Equal Kmeans clustering method is proposed in the study. The proposed virtual multi-label approach is used to solve the imbalanced problem. The proposed method is compared with the commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and One-Class SVM). The result shows that the proposed method will have better performance when the degree of data imbalance increases, and it will gradually outperform other methods.
|Reference: ||Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. Paper presented at the European conference on machine learning.|
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Paper presented at the Proceedings of the 23rd international conference on Machine learning.
Ertekin, S., Huang, J., Bottou, L., & Giles, L. (2007). Learning on the border: active learning in imbalanced data classification. Paper presented at the Proceedings of the sixteenth ACM conference on Conference on information and knowledge management.
Ertekin, S., Huang, J., & Giles, C. L. (2007). Active learning for class imbalance problem. Paper presented at the Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval.
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine learning, 31(1), 1-38.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Paper presented at the icml.
Fushing, H., & Wang, X. (2020). Coarse- and fine-scale geometric information content of Multiclass Classification and implied Data-driven Intelligence. Proceedings, Machine Learning and Data Mining in Pattern Recognition, Petra Perner (Ed.), 16th International Conference on Machine Learning and Data Mining, MLDM 2020.
Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Paper presented at the International conference on intelligent computing.
Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine learning, 45(2), 171-186.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Paper presented at the 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence).
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
Hong, X., Chen, S., & Harris, C. J. (2007). A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on neural networks, 18(1), 28-41.
Japkowicz, N. (2001). Supervised versus unsupervised binary-learning by feedforward neural networks. Machine learning, 42(1-2), 97-122.
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), 429-449.
Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49.
Kang, P., & Cho, S. (2006). EUS SVMs: Ensemble of under-sampled SVMs for data imbalance problems. Paper presented at the International Conference on Neural Information Processing.
Kukar, M., & Kononenko, I. (1998). Cost-sensitive learning with neural networks. Paper presented at the ECAI.
Lee, H.-j., & Cho, S. (2006). The novelty detection approach for different degrees of class imbalance. Paper presented at the International conference on neural information processing.
Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.
Liu, Y., An, A., & Huang, X. (2006). Boosting prediction accuracy on imbalanced datasets with SVM ensembles. Paper presented at the Pacific-Asia Conference on Knowledge Discovery and Data Mining.
Maloof, M. A. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. Paper presented at the ICML-2003 workshop on learning from imbalanced data sets II.
Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. Paper presented at the Proceedings of workshop on learning from imbalanced datasets.
Ramey, J. (2016). Datamicroarray: collection of data sets for classification. In: URL https://github. com/boost-R/datamicroarray.
Raskutti, B., & Kowalczyk, A. (2004). Extreme re-balancing for SVMs: a case study. ACM Sigkdd Explorations Newsletter, 6(1), 60-69.
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural computation, 13(7), 1443-1471.
Sun, Y., Kamel, M. S., & Wang, Y. (2006). Boosting for learning multiple classes with imbalanced class distribution. Paper presented at the Sixth International Conference on Data Mining (ICDM'06).
Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12), 3358-3378.
Tang, Y., & Zhang, Y.-Q. (2006). Granular SVM with repetitive undersampling for highly imbalanced protein homology prediction. Paper presented at the 2006 IEEE International Conference on Granular Computing.
Wang, B. X., & Japkowicz, N. (2008). Boosting support vector machines for imbalanced data sets. Paper presented at the International Symposium on Methodologies for Intelligent Systems.
Zou, K. H., O’Malley, A. J., & Mauri, L. (2007). Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation, 115(5), 654-657.
|Source URI: ||http://thesis.lib.nccu.edu.tw/record/#G0107354002|
|Data Type: ||thesis|
|Appears in Collections:||[統計學系] 學位論文|
Files in This Item:
All items in 政大典藏 are protected by copyright, with all rights reserved.