Manifold Learning and Undersampling Approaches for Imbalanced Class Sentiment Classification

L. M. Risman Dwi Jumansyah; Agus Mohamad Soleh; Utami Dyah Syafitri

doi:10.17977/um018v7i22024p139-151

Manifold Learning and Undersampling Approaches for Imbalanced Class Sentiment Classification

L. M. Risman Dwi Jumansyah, Agus Mohamad Soleh, Utami Dyah Syafitri

Abstract

Movie reviews are crucial in determining a film's success by influencing audience decisions. Automating sentiment classification is essential for efficient public opinion analysis. However, it faces challenges such as high-dimensional data and imbalanced class distributions. This study addresses these issues by applying manifold learning techniques, Principal Component Analysis (PCA) and Laplacian Eigenmaps (LE) to reduce data complexity and undersampling strategies (Random Undersampling (RUS) and EasyEnsemble) to balance data and improve predictions for both sentiment classes. On reviews of The Raid 2: Berandal, EasyEnsemble achieved the highest average G-Mean of 0.694 using Term Frequency-Inverse Document Frequency (TF-IDF) features with a linear kernel without dimensionality reduction. RUS provided balanced but inconsistent results, while Review of Systems (ROS) combined with PCA (85% variance cumulative) improved predictions for negative reviews. Laplacian Eigenmaps were effective for negative reviews with 500 dimensions but less accurate for positive ones. This study highlights EasyEnsemble's superior performance in addressing the class imbalance, though optimization with manifold learning remains challenging.

Full Text:

PDF

References

Z. Fan, Y. Guo, Z. Zhang, and M. Han, “Sentiment analysis of movie reviews based on dictionary and weak tagging information,” J. Comput. Appl., vol. 38, no. 11, pp. 3048–3088, 2018.

T. P. Sahu and S. Ahuja, “Sentiment analysis of movie reviews: A study on feature selection & classification algorithms,” in 2016 International Conference on Microelectronics, Computing and Communications (MicroCom), Jan. 2016, pp. 1–6.

M. Govindarajan, “Sentiment analysis of movie reviews using hybrid method of naive bayes and genetic algorithm,” Int. J. Adv. Comput. Res., vol. 3, no. 4, p. 139, 2013.

E. Kontopoulos, C. Berberidis, T. Dergiades, and N. Bassiliades, “Ontology-based sentiment analysis of twitter posts,” Expert Syst. Appl., vol. 40, no. 10, pp. 4065–4074, Aug. 2013.

Q. Wang, G. Zhu, S. Zhang, K. C. Li, X. Chen, and H. Xu, “Extending emotional lexicon for improving the classification accuracy of Chinese film reviews,” Conn. Sci., pp. 1–20, 2020.

T. A. Khan, R. Sadiq, Z. Shahid, M. M. Alam, and M. B. M. Su’ud, “Sentiment analysis using support vector machine and random forest,” J. Informatics Web Eng., vol. 3, no. 1, pp. 67–75, Feb. 2024.

S. Lin, R. Zhang, Z. Yu, and N. Zhang, “Sentiment analysis of movie reviews based on improved word2vec and ensemble learning,” in Journal of Physics: Conference Series, Dec. 2020, vol. 1693, no. 1.

A. Pandey, R. Yadav, A. Pathak, N. Shivani, B. Garg, and A. Pandey, “Sentiment Analysis of IMDB Movie Reviews,” in 2024 First International Conference on Software, Systems and Information Technology (SSITCON), Oct. 2024, pp. 1–6.

S. Matsumoto, H. Takamura, and M. Okumura, “Sentiment classification using word sub-sequences and dependency sub-trees,” in 9th Pacific-Asia Conference, PAKDD 2005, Hanoi, Vietnam, May 18-20, 2005. Proceedings 9, 2005, pp. 301–311.

M. Lango, “Tackling the problem of class imbalance in multi-class sentiment classification: an experimental study,” Found. Comput. Decis. Sci., vol. 44, no. 2, pp. 151–178, Jun. 2019.

O. B. Deho, S. Joksimovic, J. Li, C. Zhan, J. Liu, and L. Liu, “Should Learning Analytics Models Include Sensitive Attributes? Explaining the Why,” IEEE Trans. Learn. Technol., vol. 16, no. 4, pp. 560–572, Aug. 2023.

A. S. Ghareb, A. A. Bakar, and A. R. Hamdan, “Hybrid feature selection based on enhanced genetic algorithm for text categorization,” Expert Syst. Appl., vol. 49, pp. 31–47, May 2016.

R. Kumbhar, S. Mhamane, H. Patil, S. Patil, and S. Kale, “Text document clustering using k-means algorithm with dimension reduction techniques,” in 2020 5th International Conference on Communication and Electronics Systems (ICCES), Jun. 2020, pp. 1222–1228.

V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos, “Feature selection for high-dimensional data,” Prog. Artif. Intell., vol. 5, no. 2, pp. 65–75, May 2016.

M. Rodríguez-Ibáñez, F. J. Gimeno-Blanes, P. M. Cuenca-Jiménez, C. Soguero-Ruiz, and J. L. Rojo-Álvarez, “Sentiment analysis of political tweets from the 2019 Spanish elections,” IEEE Access, vol. 9, pp. 101847–101862, 2021.

S. Baehera, U. D. Syafitri, and A. M. Soleh, “Evaluasi perbandingan kinerja algoritma Cheng and church biclustering terhadap algoritma clustering klasik k-means untuk mengidentifikasi pola distribusi barang ekspor Indonesia,” J. Stat. dan Apl., vol. 7, no. 2, pp. 149–161, Dec. 2023.

G. Vinodhini and R. M. Chandrasekaran, “Sentiment mining using SVM-based hybrid classification model,” in In Computational Intelligence, Cyber Security and Computational Models: Proceedings of ICC3, 2014, pp. 155–162.

P. Verma, T. Bhardwaj, A. Bhatia, and M. Mursleen, “Sentiment Analysis ‘Using SVM, KNN and SVM with PCA,’” Springer, 2023, pp. 35–53.

Y. Sun and F. Zhang, “Optimization of classification results on gene expression datasets using dimensionality reduction,” in CAIBDA 2022; 2nd International Conference on Artificial Intelligence, Big Data and Algorithms, Jun. 2022, pp. 1–11.

K. Kim and J. Lee, “Sentiment visualization and classification via semi-supervised nonlinear dimensionality reduction,” Pattern Recognit., vol. 47, no. 2, pp. 758–768, Feb. 2014.

S. N. Almuayqil, M. Humayun, N. Z. Jhanjhi, M. F. Almufareh, and N. A. Khan, “Enhancing sentiment analysis via random majority under-sampling with reduced time complexity for classifying tweet reviews,” Electronics, vol. 11, no. 21, p. 3624, Nov. 2022.

J. Prusa, T. M. Khoshgoftaar, D. J. Dittman, and A. Napolitano, “Using random under sampling to alleviate class imbalance on tweet sentiment data,” in 2015 IEEE International Conference on Information Reuse and Integration, Aug. 2015, pp. 197–202.

T. Komamizu, Y. Ogawa, and K. Toyama, “An ensemble framework of multi-ratio undersampling-based imbalanced classification,” J. Data Intell., vol. 2, no. 1, pp. 30–46, Mar. 2021.

J. Zhou, F. Chen, A. Khattak, and S. Dong, “Interpretable ensemble-imbalance learning strategy on dealing with imbalanced vehicle-bicycle crash data: A case study of Ningbo, China,” Int. J. Crashworthiness, pp. 1–14, Mar. 2024.

X. Ren, Z. Yuan, and J. Huang, “Research on fake reviews detection based on feature construction and easyensemble-rf,” in 2021 2nd International Conference on Artificial Intelligence and Computer Engineering (ICAICE), Nov. 2021, pp. 478–482.

K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-processing and data augmentation techniques,” Glob. Transitions Proc., vol. 3, no. 1, pp. 91–99, Jun. 2022.

M. George, “Improving sentiment analysis of financial news headlines using hybrid Word2Vec-TFIDF feature extraction technique,” Procedia Comput. Sci., vol. 244, pp. 1–8, 2024.

M. Chiny, M. Chihab, O. Bencharef, and Y. Chihab, “LSTM, VADER and TF-IDF based hybrid sentiment analysis model,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 7, pp. 265–275, 2021.

M. Jain, P. Goel, P. Singla, and R. Tehlan, “Comparison of Various Word Embeddings for Hate-Speech Detection,” 2021, pp. 251–265.

H. Wang, “Word2vec and SVM fusion for advanced sentiment analysis on Amazon reviews,” Highlights Sci. Eng. Technol., vol. 85, pp. 743–749, Mar. 2024.

M. Razzaghnoori, H. Sajedi, and I. K. Jazani, “Question classification in Persian using word vectors and frequencies,” Cogn. Syst. Res., vol. 47, pp. 16–27, Jan. 2018.

M. Wankhade, A. C. S. Rao, and C. Kulkarni, “A survey on sentiment analysis methods, applications, and challenges,” Artif. Intell. Rev., vol. 55, no. 7, pp. 5731–5780, Oct. 2022.

S. Nanga et al., “Review of Dimension Reduction Methods,” J. Data Anal. Inf. Process., vol. 09, no. 03, pp. 189–231, 2021.

N. Pospelov, A. Tetereva, O. Martynova, and K. Anokhin, “The Laplacian eigenmaps dimensionality reduction of fMRI data for discovering stimulus-induced changes in the resting-state brain activity,” Neuroimage: Reports, vol. 1, no. 3, p. 100035, Sep. 2021.

G. Srivastava and M. Jangid, “Multi-view Sparse Laplacian Eigenmaps for nonlinear Spectral Feature Selection,” in 2023 International Conference on System Science and Engineering (ICSSE), Jul. 2023, pp. 548–553.

V. R. P. Borges, “Visualizing multidimensional data based on Laplacian Eigenmaps projection,” in 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Oct. 2014, pp. 1654–16593.

R. Bertolini, S. J. Finch, and R. H. Nehm, “Quantifying variability in predictions of student performance: Examining the impact of bootstrap resampling in data pipelines,” Comput. Educ. Artif. Intell., vol. 3, p. 100067, 2022.

T. Wang, C. Lu, W. Ju, and C. Liu, “Imbalanced heartbeat classification using EasyEnsemble technique and global heartbeat information,” Biomed. Signal Process. Control, vol. 71, p. 103105, Jan. 2022.

J. G. Moreno-Torres, J. A. Saez, and F. Herrera, “Study on the Impact of Partition-Induced Dataset Shift on k-Fold Cross-Validation,” IEEE Trans. Neural Networks Learn. Syst., vol. 23, no. 8, pp. 1304–1312, Aug. 2012.

M. Bansal, A. Goyal, and A. Choudhary, “A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning,” Decis. Anal. J., vol. 3, p. 100071, Jun. 2022.

J. Bektaş, “EKSL: An effective novel dynamic ensemble model for unbalanced datasets based on LR and SVM hyperplane-distances,” Inf. Sci. (Ny)., vol. 597, pp. 182–192, Jun. 2022.

V. P. K. Turlapati and M. R. Prusty, “Outlier-SMOTE: A refined oversampling technique for improved detection of COVID-19,” Intell. Med., vol. 3–4, p. 100023, Dec. 2020.

B. Mirza, D. Haroon, B. Khan, A. Padhani, and T. Q. Syed, “Deep Generative Models to Counter Class Imbalance: A Model-Metric Mapping With Proportion Calibration Methodology,” IEEE Access, vol. 9, pp. 55879–55897, 2021.

R. Drikvandi and O. Lawal, “Sparse principal component analysis for natural language processing,” Ann. Data Sci., vol. 10, no. 1, pp. 25–41, Feb. 2023.

DOI: http://dx.doi.org/10.17977/um018v7i22024p139-151