Measuring Resampling Methods on Imbalanced Educational Dataset’s Classification Performance


  • Irfan Pratama Universitas Mercu Buana Yogyakarta
  • Putri Taqwa Prasetyaningrum Universitas Mercu Buana Yogyakarta
  • Albert Yakobus Chandra Universitas Mercu Buana Yogyakarta
  • Ozzi Suria Universitas mercu Buana Yogyakarta


Classification, Educational data mining, Imbalance class, Resampling


Imbalanced data refers to a condition that there is a different size of samples between one class with another class(es). It made the term “majority” class that represents the class with more instances number on the dataset and “minority” classes that represent the class with fewer instances number on the dataset. Under the target of educational data mining which demands accurate measurement of the student’s performance analysis, data mining requires an appropriate dataset to produce good accuracy. This study aims to measure the resampling method’s performance through the classification process on the student’s performance dataset, which is also a multi-class dataset. Thus, this study also measures how the method performs on a multi-class classification problem. Utilizing four public educational datasets, which consist of the result of an educational process, this study aims to get a better picture of which resampling methods are suitable for that kind of dataset. This research uses more than twenty resampling methods from the SMOTE variants library. as a comparison; this study implements nine classification methods to measure the performance of the resampled data with the non-resampled data. According to the results, SMOTE-ENN is generally the better resampling method since it produces a 0,97 F1 score under the Stacking classification method and the highest among others. However, the resampling method performs relatively low on the dataset with wider label variations. The future work of this study is to dig deeper into why the resampling method cannot handle the enormous class variation since the F1 score on the student dataset is lower than the other dataset.


Download data is not yet available.


