Maximum Marginal Relevance and Vector Space Model for Summarizing Students' Final Project Abstracts

Gunawan Gunawan, Fitria Fitria, Esther Irawati Setiawan, Kimiya Fujisawa

Abstract


Automatic summarization is reducing a text document with a computer program to create a summary that retains the essential parts of the original document. Automatic summarization is necessary to deal with information overload, and the amount of data is increasing. A summary is needed to get the contents of the article briefly. A summary is an effective way to present extended information in a concise form of the main contents of an article, and the aim is to tell the reader the essence of a central idea. The simple concept of a summary is to take an essential part of the entire contents of the article. Which then presents it back in summary form. The steps in this research will start with the user selecting or searching for text documents that will be summarized with keywords in the abstract as a query. The proposed approach performs text preprocessing for documents: sentence breaking, case folding, word tokenizing, filtering, and stemming. The results of the preprocessed text are weighted by term frequency-inverse document frequency (tf-idf), then weighted for query relevance using the vector space model and sentence similarity using cosine similarity. The next stage is maximum marginal relevance for sentence extraction. The proposed approach provides comprehensive summarization compared with another approach. The test results are compared with manual summaries, which produce an average precision of 88%, recall of 61%, and f-measure of 70%.


Full Text:

PDF

References


E. I. Setiawan, V. Natalie, J. Santoso, and K. Fujisawa, “Sequential pattern mining to support customer relationship management at beauty clinics,” Bulletin of Social Informatics Theory and Application, vol. 6, no. 2, pp. 168–176, 2022.

M. F. Mridha, A. A. Lima, K. Nur, S. C. Das, M. Hasan, and M. M. Kabir, “A survey of automatic text summarization: Progress, process and challenges,” IEEE Access, vol. 9, pp. 156043–156070, 2021.

M. Wang, X. Wang, and C. Xu, “An approach to concept-obtained text summarization,” in IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005., 2005, pp. 1337–1340.

E. S. Negara and D. Triadi, “Topic modeling using latent dirichlet allocation (LDA) on twitter data with Indonesia keyword,” Bulletin of Social Informatics Theory and Application, vol. 5, no. 2, pp. 124–132, 2021.

E. Hovy, “Text Summarization Chapter 32,” Information Sciences Institute of the University of Southern California, 2003.

H. Haviluddin and R. Alfred, “Big data: issues, trends, problems, controversies in ASEAN perspective,” Bulletin of Social Informatics Theory and Application, vol. 3, no. 2, pp. 80–93, 2019.

B. Prasetyo, F. S. Aziz, K. Faqih, W. Primadi, R. Herdianto, and W. Febriantoro, “A review: evolution of big data in developing country,” Bulletin of Social Informatics Theory and Application, vol. 3, no. 1, pp. 30–37, 2019.

J. K. Lê and T. Schmid, “The practice of innovating research methods,” Organ Res Methods, vol. 25, no. 2, pp. 308–336, 2022.

H. C. Manh, H. Le Thanh, and T. L. Minh, “Extractive Multi-document Summarization using K-means, Centroid-based Method, MMR, and Sentence Position,” in Proceedings of the 10th International Symposium on Information and Communication Technology, 2019, pp. 29–35.

S. Tuhpatussania, E. Utami, and A. D. Hartanto, “Comparison Of Lexrank Algorithm And Maximum Marginal Relevance In Summary Of Indonesian News Text In Online News Portals,” Jurnal Pilar Nusa Mandiri, vol. 18, no. 2, pp. 187–192, 2022.

J. Goldstein and J. G. Carbonell, “Summarization:(1) using MMR for diversity-based reranking and (2) evaluating summaries,” in TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop held at Baltimore, Maryland, October 13-15, 1998, 1998, pp. 181–195.

D. Gunawan, S. H. Harahap, and R. F. Rahmat, “Multi-document summarization by using textrank and maximal marginal relevance for text in Bahasa Indonesia,” in 2019 International conference on ICT for smart society (ICISS), 2019, pp. 1–5.

D. P. Purbawa, R. N. E. Anggraini, R. Sarno, and others, “Automatic Text Summarization using Maximum Marginal Relevance for Health Ethics Protocol Document in Bahasa,” in 2021 13th International Conference on Information & Communication Technology and System (ICTS), 2021, pp. 324–329.

Y. Mao, Y. Qu, Y. Xie, X. Ren, and J. Han, “Multi-document summarization with maximal marginal relevance-guided reinforcement learning,” arXiv preprint arXiv:2010.00117, 2020.

P. Gupta, S. Nigam, and R. Singh, “A Ranking based Language Model for Automatic Extractive Text Summarization,” in 2022 First International Conference on Artificial Intelligence Trends and Pattern Recognition (ICAITPR), 2022, pp. 1–5.

A. Mahajani, V. Pandya, I. Maria, and D. Sharma, “Ranking-based sentence retrieval for text summarization,” in Smart Innovations in Communication and Computational Sciences: Proceedings of ICSICCS-2018, 2019, pp. 465–474.

X. Jiang, X.-Z. Fan, Z.-F. Wang, and K.-L. Jia, “Improving the performance of text categorization using automatic summarization,” in 2009 International Conference on Computer Modeling and Simulation, 2009, pp. 347–351.

S. Cahyawijaya et al., “NusaCrowd: Open Source Initiative for Indonesian NLP Resources,” arXiv preprint arXiv:2212.09648, 2022.

M. Hassel, “Evaluation of automatic text summarization,” Licentiate Thesis, Stockholm, Sweden, pp. 1–75, 2004.

E. Hovy and C.-Y. Lin, “Automated text summarization in SUMMARIST, Advances in Automatic Text Summarization.” MIT Press, 1999.

M. O. El-Haj and B. H. Hammo, “Evaluation of query-based Arabic text summarization system,” in 2008 International Conference on Natural Language Processing and Knowledge Engineering, 2008, pp. 1–7.

Y. Mao, “Guided text summarization with limited supervision,” 2022.

A. P. Widyassari et al., “Review of automatic text summarization techniques & methods,” Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 4, pp. 1029–1046, 2022.

R. A. García-Hernández and Y. Ledeneva, “Word sequence models for single text summarization,” in 2009 Second International Conferences on Advances in Computer-Human Interactions, 2009, pp. 44–48.

R. M. Losee, “Term dependence: A basis for Luhn and Zipf models,” Journal of the American Society for Information Science and Technology, vol. 52, no. 12, pp. 1019–1025, 2001.

B. Toth, D. Hakkani-Tür, and S. Yaman, “Summarization-and learning-based approaches to information distillation,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 5306–5309.

S. Basak, M. D. D. H. Gazi, and S. M. Mazharul Hoque Chowdhury, “A Review Paper on Comparison of Different Algorithm Used in Text Summarization,” Intelligent Data Communication Technologies and Internet of Things: ICICI 2019, pp. 114–119, 2020.

D. Yadav et al., “Qualitative analysis of text summarization techniques and its applications in health domain,” Comput Intell Neurosci, vol. 2022, 2022.

S. Xie, Automatic extractive summarization on meeting corpus. The University of Texas at Dallas, 2010.

S. Xie and Y. Liu, “Using corpus and knowledge-based similarity measure in maximum marginal relevance for meeting summarization,” in 2008 IEEE international conference on acoustics, speech and signal processing, 2008, pp. 4985–4988.

N. Yusliani, R. Primartha, and M. D. Marieska, “Multiprocessing Stemming: A Case Study of Indonesian Stemming,” International Journal Computer and Applications (IJCA), vol. 182, no. 40, pp. 15–19, 2019.

M. I. Aziz, “Development Program Application To The Measurement Of Documents Resemblance Text mining, TFIDF, And Vector space model Algoritm,” Undergraduate Program, Faculty of Industrial Engineering, Gunadarma University, 2010.

G. Patil and A. Patil, “Web information extraction and classification using vector space model algorithm,” Int. J. Emerg. Technol. Adv. Eng, vol. 1, no. 2, 2011.

J. Golstein, “Genre oriented summarization,” Unpublished doctoral thesis submitted to Carnegie Melon University. Received in March, vol. 2, p. 2019, 2008.

I. R. Musyaffanto, G. B. Herwanto, and M. Riasetiawan, “Automatic Extractive Text Summarization for Indonesian News Articles Using Maximal Marginal Relevance and Non-Negative Matrix Factorization,” in 2019 5th International Conference on Science and Technology (ICST), 2019, pp. 1–6.

J. D. Kapoor and K. K. Devadkar, “Generating Auto Text Summarization From Document Using Clustering,” Int. J. Appl. Eng. Res. Dev., vol. 4, no. 2, pp. 31–34, 2014.

M. Chen and Y. Song, “Summarization of text clustering based vector space model,” in 2009 IEEE 10th International Conference on Computer-Aided Industrial Design & Conceptual Design, 2009, pp. 2362–2365.

R. Singh and S. Singh, “Text similarity measures in news articles by vector space model using NLP,” Journal of The Institution of Engineers (India): Series B, vol. 102, pp. 329–338, 2021.

T. Xing, Z. Xiangxian, G. Shunli, and Z. Liman, “Automatic summarization of user-generated content in academic Q&A community based on Word2Vec and MMR,” Data Analysis and Knowledge Discovery, vol. 4, no. 4, pp. 109–118, 2020.

N. Alami, M. El Mallahi, H. Amakdouf, and H. Qjidaa, “Hybrid method for text summarization based on statistical and semantic treatment,” Multimed Tools Appl, vol. 80, pp. 19567–19600, 2021.

S. Cahyawijaya et al., “NusaCrowd: Open Source Initiative for Indonesian NLP Resources,” arXiv preprint arXiv:2212.09648, 2022.




DOI: http://dx.doi.org/10.17977/um018v6i12023p57-68

Refbacks

  • There are currently no refbacks.


Copyright (c) 2023 Knowledge Engineering and Data Science

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Flag Counter

Creative Commons License


This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

View My Stats