Similarity Identification of Large-scale Biomedical Documents using Cosine Similarity and Parallel Computing

Merlinda Wibowo, Christoph Quix, Nur Syahela Hussien, Herman Yuliansyah, Faisal Dharma Adhinata


Document similarity computation is an important research topic in information retrieval, and it is a crucial issue for automatic document categorization. The similarity value is between 0 and 1, then the closest value to 1 is represented both documents is considered more relevant, vice versa. However, the large scale of textual information has created the problem of finding the relevance level between documents. Therefore, the relevance between mesh heading text in the PubMed documents is higher than the relevance of the abstract text in the PubMed documents. Furthermore, parallel computing is implemented to speed up the large-scale documents similarity identification process that automatically calculates in the PubMed application. The execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. The execution time of mesh heading is higher than abstract because abstract contains more words than mesh heading. This study has successfully identified the similarity between large-scale biomedical documents of the PubMed documents that implemented a cosine similarity algorithm. The result has shown that the cosine similarity of the mesh heading texts is higher than the abstract text in the form of a graph and table shown in the PubMed application. The cosine similarity is useful to measure the similarity between documents based on the TF*IDF calculation result.

Full Text:



H. Hassani, C. Beneki, S. Unger, and M. T. Mazinani, “Text Mining in Big Data Analytics,” Big Data Cogn. Comput., vol. 4, pp. 1–34, 2020.

R. Islamaj et al., “PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database,” Database, vol. 1, pp. 1–13, 2019.

S. F. Wamba, A. Gunasekaran, S. Akter, S. J. Ren, R. Dubey, and S. J. Childe, “Big data analytics and firm performance: Effects of dynamic capabilities,” J. Bus. Res., vol. 70, pp. 356–365, 2016.

M. Wibowo, F. Noviyanto, S. Sulaiman, and S. M. Shamsuddin, “Machine Learning Technique For Enhancing Classification Performance In Data Summarization Using Rough Set And Genetic Algorithm,” Int. J. Sci. Technol. Res., vol. 8, no. 10, pp. 1108–1119, 2019.

R. M. Packiam and V. S. J. Prakash, “An empirical study on text analytics in big data,” 2016.

M. Erritali, A. Beni-hssane, M. Birjali, and Y. Madani, “An Approach of Semantic Similarity Measure between Documents Based on Big Data,” Int. J. Electr. Comput. Eng., vol. 6, no. October 2017, pp. 2454–2463, 2016.

L. A. Rahim, K. Mohan, K. Id, and S. Bahattacharjee, “Framework for parallelisation on big data,” PlosOne 14(5), pp. 1–19, 2019.

B. Parhami, “Parallel Processing with Big Data,” pp. 1–7, 2018.

R. Darmawan, R. S. Wahono, “Hybrid Keyword Extraction Algorithm and Cosine Similarity for Improving Sentences Cohesion in Text Summarization,” J. Intell. Syst., vol. 1, no. 2, pp. 109–114, 2015.

S. W. Iriananda, M. A. Muslim, and H. S. Dachlan, “Identifikasi Kemiripan Teks Menggunakan Class Indexing Based dan Cosine Similarity Untuk Klasifikasi Dokumen Pengaduan,” Matics, vol. 10, no. 2, p. 30, 2019.

D. A. R. Ariantini, A. S. M. Lumenta, and A. Jacobus, “Pengukuran Kemiripan Dokumen Teks Bahasa Indonesia Menggunakan Metode Cosine Similarity,” J. Tek. Inform., vol. 9, no. 1, pp. 1–8, 2016.

M. Z. Naf’an, A. Burhanuddin, and A. Riyani, “Penerapan Cosine Similarity dan Pembobotan TF-IDF untuk Mendeteksi Kemiripan Dokumen,” J. Linguist. Komputasional, vol. 2, no. 1, pp. 23–27, 2019.

J. Wang and Y. Dong, “Measurement of text similarity: A survey,” Inf., vol. 11, no. 9, pp. 1–17, 2020.

D. Kurniadi, S. F. C. Haviana, and A. Novianto, “Implementasi Algoritma Cosine Similarity pada sistem arsip dokumen di Universitas Islam Sultan Agung,” J. Transform., vol. 17, no. 2, p. 124, 2020.

D. Gunawan, C. A. Sembiring, and M. A. Budiman, “The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents,” J. Phys. Conf. Ser., vol. 978, no. 1, 2018.

J. Bian, M. Amin, S. Jonnalagadda, G. Luo, and G. Del, “Automatic identification of high impact articles in PubMed to support clinical decision making,” J. Biomed. Inform., vol. 73, pp. 95–103, 2017.

C. W. Halladay, T. A. Trikalinos, I. T. Schmid, C. H. Schmid, and I. J. Dahabreh, “Using data sources beyond PubMed has a modest impact on the results of systematic reviews of therapeutic interventions,” in Journal of Clinical Epidemiology, 2015, vol. 68, no. 9, pp. 1076–1084.

K. Z. Vardakas, G. Tsopanakis, A. Poulopoulou, and M. E. Falagas, “An analysis of factors contributing to PubMed’s growth,” J. Informetr., vol. 9, no. 3, pp. 592–617, 2015.

MongoDB, “MongoDB,” 2017.

P. dwi Nurfadila, A. P. Wibawa, I. A. E. Zaeni, and A. Nafalski, “Journal Classification Using Cosine Similarity Method on Title and Abstract with Frequency-Based Stopword Removal ,” Int. J. Artif. Intell. Res., vol. 3, no. 2, 2019.

N. Ghasemi and S. Momtazi, “Neural text similarity of user reviews for improving collaborative filtering recommender systems,” Electron. Commer. Res. Appl., vol. 45, no. October 2019, p. 101019, 2021.

M. Wibowo, S. Sulaiman, S. Mariyam, and H. Hashim, “Mobile Analytics Database Summarization Using Rough Set,” Int. J. Innov. Comput., vol. 7, no. 2, pp. 6–12, 2017.



  • There are currently no refbacks.

Copyright (c) 2021 Knowledge Engineering and Data Science

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Flag Counter

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

View My Stats