Measuring Resampling Methods on Imbalanced Educational Dataset’s Classification Performance
DOI:
https://doi.org/10.26594/register.v10i1.3397Keywords:
Classification, Educational data mining, Imbalance class, ResamplingAbstract
Imbalanced data refers to a condition that there is a different size of samples between one class with another class(es). It made the term “majority” class that represents the class with more instances number on the dataset and “minority” classes that represent the class with fewer instances number on the dataset. Under the target of educational data mining which demands accurate measurement of the student’s performance analysis, data mining requires an appropriate dataset to produce good accuracy. This study aims to measure the resampling method’s performance through the classification process on the student’s performance dataset, which is also a multi-class dataset. Thus, this study also measures how the method performs on a multi-class classification problem. Utilizing four public educational datasets, which consist of the result of an educational process, this study aims to get a better picture of which resampling methods are suitable for that kind of dataset. This research uses more than twenty resampling methods from the SMOTE variants library. as a comparison; this study implements nine classification methods to measure the performance of the resampled data with the non-resampled data. According to the results, SMOTE-ENN is generally the better resampling method since it produces a 0,97 F1 score under the Stacking classification method and the highest among others. However, the resampling method performs relatively low on the dataset with wider label variations. The future work of this study is to dig deeper into why the resampling method cannot handle the enormous class variation since the F1 score on the student dataset is lower than the other dataset.
References
G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst. Appl., vol. 73, pp. 220–239, 2017, doi: 10.1016/j.eswa.2016.12.035.
Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data,” J. Biomed. Inform., vol. 107, no. May 2019, p. 103465, 2020, doi: 10.1016/j.jbi.2020.103465.
S. Makki, Z. Assaghir, Y. Taher, R. Haque, M. S. Hacid, and H. Zeineddine, “An Experimental Study With Imbalanced Classification Approaches for Credit Card Fraud Detection,” IEEE Access, vol. 7, pp. 93010–93022, 2019, doi: 10.1109/ACCESS.2019.2927266.
Y. Zhang and P. Trubey, “Machine Learning and Sampling Scheme: An Empirical Study of Money Laundering Detection,” Comput. Econ., vol. 54, no. 3, pp. 1043–1063, 2019, doi: 10.1007/s10614-018-9864-z.
B. A. Akinnuwesi et al., “Application of intelligence-based computational techniques for classification and early differential diagnosis of COVID-19 disease,” Data Sci. Manag., vol. 4, pp. 10–18, 2021, doi: https://doi.org/10.1016/j.dsm.2021.12.001.
G. Fan, Z. Deng, Q. Ye, and B. Wang, “Machine learning-based prediction models for patients no-show in online outpatient appointments,” Data Sci. Manag., vol. 2, pp. 45–52, 2021, doi: https://doi.org/10.1016/j.dsm.2021.06.002.
M. A. Harlev, H. S. Yin, K. C. Langenheldt, R. R. Mukkamala, and R. Vatrapu, “Breaking bad: De-anonymising entity types on the bitcoin blockchain using supervised machine learning,” Proc. Annu. Hawaii Int. Conf. Syst. Sci., vol. 2018-Janua, pp. 3497–3506, 2018, doi: 10.24251/hicss.2018.443.
I. Alarab and S. Prakoonwit, “Effect of data resampling on feature importance in imbalanced blockchain data: Comparison studies of resampling techniques,” Data Sci. Manag., vol. 5, no. 2, pp. 66–76, 2022, doi: 10.1016/j.dsm.2022.04.003.
Y. Pristyanto, I. Pratama, and A. F. Nugraha, “Data level approach for imbalanced class handling on educational data mining multiclass classification,” in 2018 International Conference on Information and Communications Technology, ICOIACT 2018, 2018, vol. 2018-Janua. doi: 10.1109/ICOIACT.2018.8350792.
E. Buraimoh, R. Ajoodha, and K. Padayachee, “Importance of Data Re-Sampling and Dimensionality Reduction in Predicting Students’ Success,” 3rd Int. Conf. Electr. Commun. Comput. Eng. ICECCE 2021, no. June, pp. 12–13, 2021, doi: 10.1109/ICECCE52056.2021.9514123.
D. Jahin, I. J. Emu, S. Akter, M. J. A. Patwary, M. A. S. Bhuiyan, and M. H. Miraz, “A Novel Oversampling Technique to Solve Class Imbalance Problem: A Case Study of Students’ Grades Evaluation,” in 2021 International Conference on Computing, Networking, Telecommunications & Engineering Sciences Applications (CoNTESA), 2021, pp. 69–75. doi: 10.1109/CoNTESA52813.2021.9657151.
M. Utari, B. Warsito, and R. Kusumaningrum, “Implementation of Data Mining for Drop-Out Prediction using Random Forest Method,” 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, 2020, doi: 10.1109/ICoICT49345.2020.9166276.
R. Ghorbani and R. Ghousi, “Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques,” IEEE Access, vol. 8, pp. 67899–67911, 2020, doi: 10.1109/ACCESS.2020.2986809.
M. Revathy, S. Kamalakkannan, and P. Kavitha, “Machine Learning based Prediction of Dropout Students from the Education University using SMOTE,” in 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), 2022, pp. 1750–1758. doi: 10.1109/ICSSIT53264.2022.9716450.
P. Dabhade, R. Agarwal, K. P. Alameen, A. T. Fathima, R. Sridharan, and G. Gopakumar, “Educational data mining for predicting students’ academic performance using machine learning algorithms,” Mater. Today Proc., vol. 47, no. xxxx, pp. 5260–5267, 2021, doi: 10.1016/j.matpr.2021.05.646.
A. I. Adekitan and O. Salau, “The impact of engineering students’ performance in the first three years on their graduation result using educational data mining,” Heliyon, vol. 5, no. 2, p. e01250, 2019, doi: 10.1016/j.heliyon.2019.e01250.
D. Zhang, W. Liu, X. Gong, and H. Jin, “A Novel Improved SMOTE Resampling Algorithm Based on Fractal,” J. Comput. Inf. Syst., vol. 7, Jun. 2011.
A. Aditsania, Adiwijaya, and A. L. Saonard, “Handling imbalanced data in churn prediction using ADASYN and backpropagation algorithm,” Proceeding - 2017 3rd Int. Conf. Sci. Inf. Technol. Theory Appl. IT Educ. Ind. Soc. Big Data Era, ICSITech 2017, vol. 2018-Janua, pp. 533–536, 2017, doi: 10.1109/ICSITech.2017.8257170.
S. Ahmed, A. Mahbub, F. Rayhan, R. Jani, S. Shatabda, and D. M. Farid, “Hybrid Methods for Class Imbalance Learning Employing Bagging with Sampling Techniques,” 2nd Int. Conf. Comput. Syst. Inf. Technol. Sustain. Solut. CSITSS 2017, pp. 1–5, 2018, doi: 10.1109/CSITSS.2017.8447799.
B. S. Raghuwanshi and S. Shukla, “SMOTE based class-specific extreme learning machine for imbalanced learning,” Knowledge-Based Syst., vol. 187, p. 104814, 2020, doi: 10.1016/j.knosys.2019.06.022.
D. Bajer, B. Zon?, M. Dudjak, and G. Martinovi?, “Performance Analysis of SMOTE-based Oversampling Techniques When Dealing with Data Imbalance,” in 2019 International Conference on Systems, Signals and Image Processing (IWSSIP), 2019, pp. 265–271. doi: 10.1109/IWSSIP.2019.8787306.
F. Sa?lam and M. A. Cengiz, “A novel SMOTE-based resampling technique trough noise detection and the boosting procedure,” Expert Syst. Appl., vol. 200, no. April 2020, pp. 1–12, 2022, doi: 10.1016/j.eswa.2022.117023.
H. Guo, J. Zhou, and C.-A. Wu, “Imbalanced Learning Based on Data-Partition and SMOTE,” Information , vol. 9, no. 9. 2018. doi: 10.3390/info9090238.
A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, Data Level Preprocessing Methods. 2018. doi: 10.1007/978-3-319-98074-4_5.
Y. Pristyanto, N. A. Setiawan, and I. Ardiyanto, “Hybrid resampling to handle imbalanced class on classification of student performance in classroom,” Proc. - 2017 1st Int. Conf. Informatics Comput. Sci. ICICoS 2017, vol. 2018-Janua, pp. 207–212, 2017, doi: 10.1109/ICICOS.2017.8276363.
M. Zeng, B. Zou, F. Wei, X. Liu, and L. Wang, “Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data,” in 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS), 2016, pp. 225–228. doi: 10.1109/ICOACS.2016.7563084.
G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,” SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 20–29, Jun. 2004, doi: 10.1145/1007730.1007735.
H. Han, W. Y. Wang, and B. H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” Lect. Notes Comput. Sci., vol. 3644, no. PART I, pp. 878–887, 2005, doi: 10.1007/11538059_91.
G. Douzas and F. Bacao, “Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE,” Inf. Sci. (Ny)., vol. 501, pp. 118–135, 2019, doi: 10.1016/j.ins.2019.06.007.
E. A. Amrieh, T. Hamtini, and I. Aljarah, “Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods,” Int. J. Database Theory Appl., vol. 9, no. 8, pp. 119–136, 2016, doi: 10.14257/ijdta.2016.9.8.13.
P. Cortez and A. Silva, “Using data mining to predict secondary school student performance,” 15th Eur. Concurr. Eng. Conf. 2008, ECEC 2008 - 5th Futur. Bus. Technol. Conf. FUBUTEC 2008, vol. 2003, no. 2000, pp. 5–12, 2008.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 irfan pratama, Putri Taqwa Prasetyaningrum, Albert Yakobus Chandra, Ozzi Suria
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Please find the rights and licenses in Register: Jurnal Ilmiah Teknologi Sistem Informasi. By submitting the article/manuscript of the article, the author(s) agree with this policy. No specific document sign-off is required.
1. License
The non-commercial use of the article will be governed by the Creative Commons Attribution license as currently displayed on Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
2. Author(s)' Warranties
The author warrants that the article is original, written by stated author(s), has not been published before, contains no unlawful statements, does not infringe the rights of others, is subject to copyright that is vested exclusively in the author and free of any third party rights, and that any necessary written permissions to quote from other sources have been obtained by the author(s).
3. User/Public Rights
Register's spirit is to disseminate articles published are as free as possible. Under the Creative Commons license, Register permits users to copy, distribute, display, and perform the work for non-commercial purposes only. Users will also need to attribute authors and Register on distributing works in the journal and other media of publications. Unless otherwise stated, the authors are public entities as soon as their articles got published.
4. Rights of Authors
Authors retain all their rights to the published works, such as (but not limited to) the following rights;
Copyright and other proprietary rights relating to the article, such as patent rights,
The right to use the substance of the article in own future works, including lectures and books,
The right to reproduce the article for own purposes,
The right to self-archive the article (please read out deposit policy),
The right to enter into separate, additional contractual arrangements for the non-exclusive distribution of the article's published version (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal (Register: Jurnal Ilmiah Teknologi Sistem Informasi).
5. Co-Authorship
If the article was jointly prepared by more than one author, any authors submitting the manuscript warrants that he/she has been authorized by all co-authors to be agreed on this copyright and license notice (agreement) on their behalf, and agrees to inform his/her co-authors of the terms of this policy. Register will not be held liable for anything that may arise due to the author(s) internal dispute. Register will only communicate with the corresponding author.
6. Royalties
Being an open accessed journal and disseminating articles for free under the Creative Commons license term mentioned, author(s) aware that Register entitles the author(s) to no royalties or other fees.
7. Miscellaneous
Register will publish the article (or have it published) in the journal if the article’s editorial process is successfully completed. Register's editors may modify the article to a style of punctuation, spelling, capitalization, referencing and usage that deems appropriate. The author acknowledges that the article may be published so that it will be publicly accessible and such access will be free of charge for the readers as mentioned in point 3.