Measuring Resampling Methods on Imbalanced Educational Dataset’s Classification Performance

Authors

  • irfan pratama universitas mercu buana yogyakarta
  • Putri Taqwa Prasetyaningrum Universitas Mercu Buana Yogyakarta
  • Albert Yakobus Chandra Universitas Mercu Buana Yogyakarta
  • Ozzi Suria Universitas mercu Buana Yogyakarta

DOI:

https://doi.org/10.26594/register.v10i1.3397

Keywords:

Classification, Educational data mining, Imbalance class, Resampling

Abstract

Imbalanced data refers to a condition that there is a different size of samples between one class with another class(es). It made the term “majority” class that represents the class with more instances number on the dataset and “minority” classes that represent the class with fewer instances number on the dataset. Under the target of educational data mining which demands accurate measurement of the student’s performance analysis, data mining requires an appropriate dataset to produce good accuracy. This study aims to measure the resampling method’s performance through the classification process on the student’s performance dataset, which is also a multi-class dataset. Thus, this study also measures how the method performs on a multi-class classification problem. Utilizing four public educational datasets, which consist of the result of an educational process, this study aims to get a better picture of which resampling methods are suitable for that kind of dataset. This research uses more than twenty resampling methods from the SMOTE variants library. as a comparison; this study implements nine classification methods to measure the performance of the resampled data with the non-resampled data. According to the results, SMOTE-ENN is generally the better resampling method since it produces a 0,97 F1 score under the Stacking classification method and the highest among others. However, the resampling method performs relatively low on the dataset with wider label variations. The future work of this study is to dig deeper into why the resampling method cannot handle the enormous class variation since the F1 score on the student dataset is lower than the other dataset.

References

G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst. Appl., vol. 73, pp. 220–239, 2017, doi: 10.1016/j.eswa.2016.12.035.

Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data,” J. Biomed. Inform., vol. 107, no. May 2019, p. 103465, 2020, doi: 10.1016/j.jbi.2020.103465.

S. Makki, Z. Assaghir, Y. Taher, R. Haque, M. S. Hacid, and H. Zeineddine, “An Experimental Study With Imbalanced Classification Approaches for Credit Card Fraud Detection,” IEEE Access, vol. 7, pp. 93010–93022, 2019, doi: 10.1109/ACCESS.2019.2927266.

Y. Zhang and P. Trubey, “Machine Learning and Sampling Scheme: An Empirical Study of Money Laundering Detection,” Comput. Econ., vol. 54, no. 3, pp. 1043–1063, 2019, doi: 10.1007/s10614-018-9864-z.

B. A. Akinnuwesi et al., “Application of intelligence-based computational techniques for classification and early differential diagnosis of COVID-19 disease,” Data Sci. Manag., vol. 4, pp. 10–18, 2021, doi: https://doi.org/10.1016/j.dsm.2021.12.001.

G. Fan, Z. Deng, Q. Ye, and B. Wang, “Machine learning-based prediction models for patients no-show in online outpatient appointments,” Data Sci. Manag., vol. 2, pp. 45–52, 2021, doi: https://doi.org/10.1016/j.dsm.2021.06.002.

M. A. Harlev, H. S. Yin, K. C. Langenheldt, R. R. Mukkamala, and R. Vatrapu, “Breaking bad: De-anonymising entity types on the bitcoin blockchain using supervised machine learning,” Proc. Annu. Hawaii Int. Conf. Syst. Sci., vol. 2018-Janua, pp. 3497–3506, 2018, doi: 10.24251/hicss.2018.443.

I. Alarab and S. Prakoonwit, “Effect of data resampling on feature importance in imbalanced blockchain data: Comparison studies of resampling techniques,” Data Sci. Manag., vol. 5, no. 2, pp. 66–76, 2022, doi: 10.1016/j.dsm.2022.04.003.

Y. Pristyanto, I. Pratama, and A. F. Nugraha, “Data level approach for imbalanced class handling on educational data mining multiclass classification,” in 2018 International Conference on Information and Communications Technology, ICOIACT 2018, 2018, vol. 2018-Janua. doi: 10.1109/ICOIACT.2018.8350792.

E. Buraimoh, R. Ajoodha, and K. Padayachee, “Importance of Data Re-Sampling and Dimensionality Reduction in Predicting Students’ Success,” 3rd Int. Conf. Electr. Commun. Comput. Eng. ICECCE 2021, no. June, pp. 12–13, 2021, doi: 10.1109/ICECCE52056.2021.9514123.

D. Jahin, I. J. Emu, S. Akter, M. J. A. Patwary, M. A. S. Bhuiyan, and M. H. Miraz, “A Novel Oversampling Technique to Solve Class Imbalance Problem: A Case Study of Students’ Grades Evaluation,” in 2021 International Conference on Computing, Networking, Telecommunications & Engineering Sciences Applications (CoNTESA), 2021, pp. 69–75. doi: 10.1109/CoNTESA52813.2021.9657151.

M. Utari, B. Warsito, and R. Kusumaningrum, “Implementation of Data Mining for Drop-Out Prediction using Random Forest Method,” 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, 2020, doi: 10.1109/ICoICT49345.2020.9166276.

R. Ghorbani and R. Ghousi, “Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques,” IEEE Access, vol. 8, pp. 67899–67911, 2020, doi: 10.1109/ACCESS.2020.2986809.

M. Revathy, S. Kamalakkannan, and P. Kavitha, “Machine Learning based Prediction of Dropout Students from the Education University using SMOTE,” in 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), 2022, pp. 1750–1758. doi: 10.1109/ICSSIT53264.2022.9716450.

P. Dabhade, R. Agarwal, K. P. Alameen, A. T. Fathima, R. Sridharan, and G. Gopakumar, “Educational data mining for predicting students’ academic performance using machine learning algorithms,” Mater. Today Proc., vol. 47, no. xxxx, pp. 5260–5267, 2021, doi: 10.1016/j.matpr.2021.05.646.

A. I. Adekitan and O. Salau, “The impact of engineering students’ performance in the first three years on their graduation result using educational data mining,” Heliyon, vol. 5, no. 2, p. e01250, 2019, doi: 10.1016/j.heliyon.2019.e01250.

D. Zhang, W. Liu, X. Gong, and H. Jin, “A Novel Improved SMOTE Resampling Algorithm Based on Fractal,” J. Comput. Inf. Syst., vol. 7, Jun. 2011.

A. Aditsania, Adiwijaya, and A. L. Saonard, “Handling imbalanced data in churn prediction using ADASYN and backpropagation algorithm,” Proceeding - 2017 3rd Int. Conf. Sci. Inf. Technol. Theory Appl. IT Educ. Ind. Soc. Big Data Era, ICSITech 2017, vol. 2018-Janua, pp. 533–536, 2017, doi: 10.1109/ICSITech.2017.8257170.

S. Ahmed, A. Mahbub, F. Rayhan, R. Jani, S. Shatabda, and D. M. Farid, “Hybrid Methods for Class Imbalance Learning Employing Bagging with Sampling Techniques,” 2nd Int. Conf. Comput. Syst. Inf. Technol. Sustain. Solut. CSITSS 2017, pp. 1–5, 2018, doi: 10.1109/CSITSS.2017.8447799.

B. S. Raghuwanshi and S. Shukla, “SMOTE based class-specific extreme learning machine for imbalanced learning,” Knowledge-Based Syst., vol. 187, p. 104814, 2020, doi: 10.1016/j.knosys.2019.06.022.

D. Bajer, B. Zon?, M. Dudjak, and G. Martinovi?, “Performance Analysis of SMOTE-based Oversampling Techniques When Dealing with Data Imbalance,” in 2019 International Conference on Systems, Signals and Image Processing (IWSSIP), 2019, pp. 265–271. doi: 10.1109/IWSSIP.2019.8787306.

F. Sa?lam and M. A. Cengiz, “A novel SMOTE-based resampling technique trough noise detection and the boosting procedure,” Expert Syst. Appl., vol. 200, no. April 2020, pp. 1–12, 2022, doi: 10.1016/j.eswa.2022.117023.

H. Guo, J. Zhou, and C.-A. Wu, “Imbalanced Learning Based on Data-Partition and SMOTE,” Information , vol. 9, no. 9. 2018. doi: 10.3390/info9090238.

A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, Data Level Preprocessing Methods. 2018. doi: 10.1007/978-3-319-98074-4_5.

Y. Pristyanto, N. A. Setiawan, and I. Ardiyanto, “Hybrid resampling to handle imbalanced class on classification of student performance in classroom,” Proc. - 2017 1st Int. Conf. Informatics Comput. Sci. ICICoS 2017, vol. 2018-Janua, pp. 207–212, 2017, doi: 10.1109/ICICOS.2017.8276363.

M. Zeng, B. Zou, F. Wei, X. Liu, and L. Wang, “Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data,” in 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS), 2016, pp. 225–228. doi: 10.1109/ICOACS.2016.7563084.

G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,” SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 20–29, Jun. 2004, doi: 10.1145/1007730.1007735.

H. Han, W. Y. Wang, and B. H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” Lect. Notes Comput. Sci., vol. 3644, no. PART I, pp. 878–887, 2005, doi: 10.1007/11538059_91.

G. Douzas and F. Bacao, “Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE,” Inf. Sci. (Ny)., vol. 501, pp. 118–135, 2019, doi: 10.1016/j.ins.2019.06.007.

E. A. Amrieh, T. Hamtini, and I. Aljarah, “Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods,” Int. J. Database Theory Appl., vol. 9, no. 8, pp. 119–136, 2016, doi: 10.14257/ijdta.2016.9.8.13.

P. Cortez and A. Silva, “Using data mining to predict secondary school student performance,” 15th Eur. Concurr. Eng. Conf. 2008, ECEC 2008 - 5th Futur. Bus. Technol. Conf. FUBUTEC 2008, vol. 2003, no. 2000, pp. 5–12, 2008.

Published

2024-02-25

How to Cite

[1]
irfan pratama, P. T. Prasetyaningrum, A. Y. Chandra, and O. Suria, “Measuring Resampling Methods on Imbalanced Educational Dataset’s Classification Performance”, regist. j. ilm. teknol. sist. inf., vol. 10, no. 1, pp. 1–12, Feb. 2024.