The Effect of Resampling on Classifier Performance: an Empirical Study

Utomo Pujianto; Muhammad Iqbal Akbar; Niendhitta Tamia Lassela; Deni Sutaji

doi:10.17977/um018v5i12022p87-100

The Effect of Resampling on Classifier Performance: an Empirical Study

Utomo Pujianto, Muhammad Iqbal Akbar, Niendhitta Tamia Lassela, Deni Sutaji

Abstract

An imbalanced class on a dataset is a common classification problem. The effect of using imbalanced class datasets can cause a decrease in the performance of the classifier. Resampling is one of the solutions to this problem. This study used 100 datasets from 3 websites: UCI Machine Learning, Kaggle, and OpenML. Each dataset will go through 3 processing stages: the resampling process, the classification process, and the significance testing process between performance evaluation values of the combination of classifier and the resampling using paired t-test. The resampling used in the process is Random Undersampling, Random Oversampling, and SMOTE. The classifier used in the classification process is Naïve Bayes Classifier, Decision Tree, and Neural Network. The classification results in accuracy, precision, recall, and f-measure values are tested using paired t-tests to determine the significance of the classifier's performance from datasets that were not resampled and those that had applied the resampling. The paired t-test is also used to find a combination between the classifier and the resampling that gives significant results. This study obtained two results. The first result is that resampling on imbalanced class datasets can substantially affect the classifier's performance more than the classifier's performance from datasets that are not applied the resampling technique. The second result is that combining the Neural Network Algorithm without the resampling provides significance based on the accuracy value. Combining the Neural Network Algorithm with the SMOTE technique provides significant performance based on the amount of precision, recall, and f-measure.

Full Text:

PDF

References

F. Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, “Data imbalance in classification: Experimental evaluation,” Inf. Sci. (Ny)., vol. 513, pp. 429–441, Mar. 2020.

A. Ali-Gombe and E. Elyan, “MFC-GAN: Class-imbalanced dataset classification using Multiple Fake Class Generative Adversarial Network,” Neurocomputing, vol. 361, pp. 212–221, Oct. 2019.

U. Pujianto, “Random forest and novel under-sampling strategy for data imbalance in software defect prediction,” Int. J. Eng. Technol., vol. 7, no. 4, pp. 39–42, 2018.

T. Chen, Y. Lu, X. Fu, N. N. Sze, and H. Ding, “A resampling approach to disaggregate analysis of bus-involved crashes using panel data with excessive zeros,” Accid. Anal. Prev., vol. 164, p. 106496, Jan. 2022.

B. Mirzaei, B. Nikpour, and H. Nezamabadi-pour, “CDBH: A clustering and density-based hybrid approach for imbalanced data classification,” Expert Syst. Appl., vol. 164, p. 114035, Feb. 2021.

C. Zhang et al., “Over-Sampling Algorithm Based on VAE in Imbalanced Classification,” in Lecture Notes in Computer Science (LNISA,volume 10967), 2018, pp. 334–344.

J. Hancock, T. M. Khoshgoftaar, and J. M. Johnson, “The Effects of Random Undersampling for Big Data Medicare Fraud Detection,” in 2022 IEEE International Conference on Service-Oriented System Engineering (SOSE), Aug. 2022, pp. 141–146.

R. Zhou et al., “Prediction Model for Infectious Disease Health Literacy Based on Synthetic Minority Oversampling Technique Algorithm,” Comput. Math. Methods Med., vol. 2022, pp. 1–6, Mar. 2022.

G. Wang, J. Wang, and K. He, “Majority-to-minority resampling for boosting-based classification under imbalanced data,” Appl. Intell., vol. 53, no. 4, pp. 4541–4562, Feb. 2022.

M. Janicka, M. Lango, and J. Stefanowski, “Using Information on Class Interrelations to Improve Classification of Multiclass Imbalanced Data: A New Resampling Algorithm,” Int. J. Appl. Math. Comput. Sci., vol. 29, no. 4, pp. 769–781, Dec. 2019.

R. Ghorbani and R. Ghousi, “Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques,” IEEE Access, vol. 8, pp. 67899–67911, 2020.

S. Saeed, A. Abdullah, N. Z. Jhanjhi, M. Naqvi, and A. Nayyar, “New techniques for efficiently k-NN algorithm for brain tumor detection,” Multimed. Tools Appl., vol. 81, no. 13, pp. 18595–18616, May 2022.

H. Xu and Y. Chen, “A block padding approach in multidimensional dependency missing data,” Eng. Appl. Artif. Intell., vol. 120, p. 105929, Apr. 2023.

H. A. Abu Alfeilat et al., “Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review,” Big Data, vol. 7, no. 4, pp. 221–248, Dec. 2019.

J. Li, S. Fong, S. Hu, R. K. Wong, and S. Mohammed, “Similarity Majority Under-Sampling Technique for Easing Imbalanced Classification Problem,” in Communications in Computer and Information Science, 2018, pp. 3–23.

J. Fonseca, G. Douzas, and F. Bacao, “Improving Imbalanced Land Cover Classification with K-Means SMOTE: Detecting and Oversampling Distinctive Minority Spectral Signatures,” Information, vol. 12, no. 7, p. 266, Jun. 2021.

Z. Shi, “Improving k-Nearest Neighbors Algorithm for Imbalanced Data Classification,” IOP Conf. Ser. Mater. Sci. Eng., vol. 719, no. 1, p. 012072, Jan. 2020.

N. Salmi and Z. Rustam, “Naïve Bayes Classifier Models for Predicting the Colon Cancer,” IOP Conf. Ser. Mater. Sci. Eng., vol. 546, no. 5, p. 052068, Jun. 2019.

S. Wahyuni, “Implementation of Data Mining to Analyze Drug Cases Using C4.5 Decision Tree,” J. Phys. Conf. Ser., vol. 970, p. 012030, Mar. 2018.

T. Thomas, A. P. Vijayaraghavan, and S. Emmanuel, “Applications of Decision Trees,” in Machine Learning Approaches in Cyber Security Analytics, Singapore: Springer Singapore, 2020, pp. 157–184.

R. Benkercha and S. Moulahoum, “Fault detection and diagnosis based on C4.5 decision tree algorithm for grid connected PV system,” Sol. Energy, vol. 173, pp. 610–634, Oct. 2018.

G. S. Reddy and S. Chittineni, “Entropy based C4.5-SHO algorithm with information gain optimization in data mining,” PeerJ Comput. Sci., vol. 7, p. e424, Apr. 2021.

I. Gonzalez-Fernandez, M. A. Iglesias-Otero, M. Esteki, O. A. Moldes, J. C. Mejuto, and J. Simal-Gandara, “A critical review on the use of artificial neural networks in olive oil production, characterization and authentication,” Crit. Rev. Food Sci. Nutr., vol. 59, no. 12, pp. 1913–1926, Jul. 2019.

A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the recent architectures of deep convolutional neural networks,” Artif. Intell. Rev., vol. 53, no. 8, pp. 5455–5516, Dec. 2020.

X. Qi, G. Chen, Y. Li, X. Cheng, and C. Li, “Applying Neural-Network-Based Machine Learning to Additive Manufacturing: Current Applications, Challenges, and Future Perspectives,” Engineering, vol. 5, no. 4, pp. 721–729, Aug. 2019.

H. Dagdougui, F. Bagheri, H. Le, and L. Dessaint, “Neural network model for short-term and very-short-term load forecasting in district buildings,” Energy Build., vol. 203, p. 109408, Nov. 2019.

Y. Wu, R. Gao, and J. Yang, “Prediction of coal and gas outburst: A method based on the BP neural network optimized by GASA,” Process Saf. Environ. Prot., vol. 133, pp. 64–72, Jan. 2020.

J. C. R. Whittington and R. Bogacz, “Theories of Error Back-Propagation in the Brain,” Trends Cogn. Sci., vol. 23, no. 3, pp. 235–250, Mar. 2019.

D. Chicco, N. Tötsch, and G. Jurman, “The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation,” BioData Min., vol. 14, no. 1, p. 13, Feb. 2021.

J. Miao and W. Zhu, “Precision–recall curve (PRC) classification trees,” Evol. Intell., vol. 15, no. 3, pp. 1545–1569, Sep. 2022.

G. Mahalle, O. Salunke, N. Kotkunde, A. K. Gupta, and S. K. Singh, “Neural network modeling for anisotropic mechanical properties and work hardening behavior of Inconel 718 alloy at elevated temperatures,” J. Mater. Res. Technol., vol. 8, no. 2, pp. 2130–2140, Apr. 2019.

DOI: http://dx.doi.org/10.17977/um018v5i12022p87-100