An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset
Keywords:classification, imbalanced dataset, oversampling, performance analysis
Class imbalance occurs when the distribution of classes between the majority and the minority classes is not the same. The data on imbalanced classes may vary from mild to severe. The effect of high-class imbalance may affect the overall classification accuracy since the model is most likely to predict most of the data that fall within the majority class. Such a model will give biased results, and the performance predictions for the minority class often have no impact on the model. The use of the oversampling technique is one way to deal with high-class imbalance, but only a few are used to solve data imbalance. This study aims for an in-depth performance analysis of the oversampling techniques to address the high-class imbalance problem. The addition of the oversampling technique will balance each class’s data to provide unbiased evaluation results in modeling. We compared the performance of Random Oversampling (ROS), ADASYN, SMOTE, and Borderline-SMOTE techniques. All oversampling techniques will be combined with machine learning methods such as Random Forest, Logistic Regression, and k-Nearest Neighbor (KNN). The test results show that Random Forest with Borderline-SMOTE gives the best value with an accuracy value of 0.9997, 0.9474 precision, 0.8571 recall, 0.9000 F1-score, 0.9388 ROC-AUC, and 0.8581 PRAUC of the overall oversampling technique.
J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder and N. Seliya, "A survey on addressing high-class imbalance in big data," J Big Data, vol. 5, no. 42, 2018.
H. He and E. A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263-1284, 2009.
I. Triguero, S. d. Río, V. López, J. Bacardit, J. M. Benítez and F. Herrera, "ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: An extremely imbalanced big data bioinformatics problem," Knowledge-Based Systems, vol. 87, pp. 69-79, 2015.
H. Kaur, H. S. Pannu and A. K. Malhi, "A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions," ACM Comput. Surv., vol. 52, no. 4, 2019.
D. J. Dittman, T. M. Khoshgoftaar and A. Napolitano, "The Effect of Data Sampling When Using Random Forest on Imbalanced Bioinformatics Data," in 2015 IEEE International Conference on Information Reuse and Integration, San Francisco, CA, USA, 2015.
I. Indrajani, Y. Heryadi, L. A. Wulandhari and B. S. Abbas, "Recognizing debit card fraud transaction using CHAID and K-nearest neighbor: Indonesian Bank case," in 2016 11th International Conference on Knowledge, Information and Creativity Support Systems (KICSS), Yogyakarta, 2016.
A. G. Pertiwi, N. Bachtiar, R. Kusumaningrum, I. Waspada and A. Wibowo, "Comparison of performance of k-nearest neighbor algorithm using smote and k-nearest neighbor algorithm without smote in diagnosis of diabetes disease in balanced data," Journal of Physics: Conference Series, 2020.
S. Cui, D. Wang, Y. Wang, P.-W. Yu and Y. Jin, "An improved support vector machine-based diabetic readmission prediction," Computer Methods and Programs in Biomedicine, vol. 166, pp. 123-135, 2018.
R. Pruengkarn, K. W. Wong and C. C. Fung, "Imbalanced data classification using complementary fuzzy support vector machine techniques and SMOTE," in 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, 2017.
F. Last, G. Douzas and F. Bacao, "Oversampling for Imbalanced Learning Based on K-Means and SMOTE," Information Sciences, vol. 465, 2018.
J. Zhang, L. Chen and F. Abid, "Prediction of Breast Cancer from Imbalance Respect Using Cluster-Based Undersampling Method," Journal of Healthcare Engineering, vol. 2019, 2019.
N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, "SMOTE: Synthetic Minority Over-sampling Technique," Journal of Artificial Intelligence Research, vol. 16, p. 321–357, 2002.
H. Han, W.-Y. Wang and B.-H. Mao, "Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang DS., Zhang XP., Huang GB," in Advances in Intelligent Computing. ICIC 2005, Berlin, Heidelberg, 2005.
H. He, Y. Bai, E. A. Garcia and S. Li, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning," in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 2008.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, "Modeling wine preferences by data mining from physicochemical properties," Decision Support Systems, vol. 47, no. 4, pp. 547-553, 2009.
S. Ali, A. Majid, S. G. Javed and M. Sattar, "Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data," Computers in Biology and Medicine, vol. 73, pp. 38-46, 2016.
A. D. Pozzolo, O. Caelen, Y.-A. L. Borgne, S. Waterschoot and G. Bontempi, "Learned lessons in credit card fraud detection from a practitioner perspective," Expert Systems with Applications, vol. 41, no. 10, pp. 4915-4928, 2014.
S. Makki, Z. Assaghir, Y. Taher, R. Haque, M. Hacid and H. Zeineddine, "An Experimental Study With Imbalanced Classification Approaches for Credit Card Fraud Detection," IEEE Access, vol. 7, pp. 93010-93022, 2019.
C. Meng, L. Zhou and B. Liu, "A Case Study in Credit Fraud Detection With SMOTE and XGBoost," Journal of Physics: Conference Series, vol. 1601, 2020.
D. Almhaithawi, A. Jafar and M. Aljnidi, "Example-dependent cost-sensitive credit cards fraud detection using SMOTE and Bayes minimum risk," SN Appl. Sci., vol. 2, no. 1574, 2020.
J. O. Awoyemi, A. O. Adetunmbi and S. A. Oluwadare, "Credit card fraud detection using machine learning techniques: A comparative analysis," in 2017 International Conference on Computing Networking and Informatics (ICCNI), Lagos, Nigeria, 2017.
W. Han, Z. Huang, S. Li and Y. Jia, "Distribution-Sensitive Unbalanced Data Oversampling Method for Medical Diagnosis," J Med Syst, vol. 43, no. 39, 2019.
B. Krawczyk, "Learning from imbalanced data: open challenges and future directions," Prog Artif Intell, vol. 5, no. 221–232, 2016.
S. Wager and S. Athey, "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests," Journal of the American Statistical Association, vol. 113, no. 523, pp. 1228-1242, 2018.
Please find the rights and licenses in Register: Jurnal Ilmiah Teknologi Sistem Informasi. By submitting the article/manuscript of the article, the author(s) agree with this policy. No specific document sign-off is required.
The non-commercial use of the article will be governed by the Creative Commons Attribution license as currently displayed on Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
2. Author(s)' Warranties
The author warrants that the article is original, written by stated author(s), has not been published before, contains no unlawful statements, does not infringe the rights of others, is subject to copyright that is vested exclusively in the author and free of any third party rights, and that any necessary written permissions to quote from other sources have been obtained by the author(s).
3. User/Public Rights
Register's spirit is to disseminate articles published are as free as possible. Under the Creative Commons license, Register permits users to copy, distribute, display, and perform the work for non-commercial purposes only. Users will also need to attribute authors and Register on distributing works in the journal and other media of publications. Unless otherwise stated, the authors are public entities as soon as their articles got published.
4. Rights of Authors
Authors retain all their rights to the published works, such as (but not limited to) the following rights;
Copyright and other proprietary rights relating to the article, such as patent rights,
The right to use the substance of the article in own future works, including lectures and books,
The right to reproduce the article for own purposes,
The right to self-archive the article (please read out deposit policy),
The right to enter into separate, additional contractual arrangements for the non-exclusive distribution of the article's published version (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal (Register: Jurnal Ilmiah Teknologi Sistem Informasi).
If the article was jointly prepared by more than one author, any authors submitting the manuscript warrants that he/she has been authorized by all co-authors to be agreed on this copyright and license notice (agreement) on their behalf, and agrees to inform his/her co-authors of the terms of this policy. Register will not be held liable for anything that may arise due to the author(s) internal dispute. Register will only communicate with the corresponding author.
Being an open accessed journal and disseminating articles for free under the Creative Commons license term mentioned, author(s) aware that Register entitles the author(s) to no royalties or other fees.
Register will publish the article (or have it published) in the journal if the article’s editorial process is successfully completed. Register's editors may modify the article to a style of punctuation, spelling, capitalization, referencing and usage that deems appropriate. The author acknowledges that the article may be published so that it will be publicly accessible and such access will be free of charge for the readers as mentioned in point 3.