Similarity Identification of Large-scale Biomedical Documents using Cosine Similarity and Parallel Computing

,

Document similarity computation is an important research topic in information retrieval, and it is a crucial issue for automatic document categorization. The similarity value is between 0 and 1, then the closest value to 1 is represented both documents is considered more relevant, vice versa. However, the large scale of textual information has created the problem of finding the relevance level between documents. Therefore, the relevance between mesh heading text in the PubMed documents is higher than the relevance of the abstract text in the PubMed documents. Furthermore, parallel computing is implemented to speed up the large-scale documents similarity identification process that automatically calculates in the PubMed application. The execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. The execution time of mesh heading is higher than abstract because abstract contains more words than mesh heading. This study has successfully identified the similarity between large-scale biomedical documents of the PubMed documents that implemented a cosine similarity algorithm. The result has shown that the cosine similarity of the mesh heading texts is higher than the abstract text in the form of a graph and table shown in the PubMed application. The cosine similarity is useful to measure the similarity between documents based on the TF*IDF calculation result.
is the thesaurus for indexing, cataloging, and searching biomedical and health-related information. The relevance between mesh heading text in the PubMed documents is higher than the relevance of the abstract text in the PubMed documents. Besides, the National Library of Medicine provides the mesh heading.
Text mining in big data analytics is emerging as a powerful tool for harnessing the power of unstructured textual data by analyzing it to extract new knowledge and to identify significant patterns and correlations hidden in the data [1] [5]. Furthermore, quickly detecting similar documents becomes a fundamental problem as times go on [6]. This difficulty is closely related to the semantic aspect of these documents. Indeed, manual operation is possible and gives good results. However, a manual procedure is not possible with a large corpus. Therefore, document similarity computation is an important research topic in information retrieval, and it is a crucial issue for automatic document categorization. Moreover, parallel computing (for big data) reduces the processing time and quickly detects similar documents [7] [8]. Thus, the parallelization of big data is emerging as an essential framework for large-scale parallel data applications.
Some research determines the similarity between text used extracted keywords generated based on term frequency-inverse document frequency (TF*IDF) [9][10] [11] [12]. This research focuses on detecting the similarity of the document. The method for calculating similarity is cosine similarity then the result demonstrates that cosine similarity can calculate the difference of text document. Keyword extraction is a vital algorithm to extract appropriate keywords that can easily choose which document to read to learn the relationship between documents in the form of document retrieval, web page retrieval, document clustering, summarization, text mining, and others. It will automatically identify terms that best describe the keywords of a document [2][9] [13]. Then, to obtain a suitable text relevance algorithm to demonstrate relevance calculation between two documents, many studies have been implemented the cosine similarity [9][14] [15]. The cosine similarity is useful to measure the similarity between documents based on the result of the keyword extraction. However, the large-scale documents are needed extra time execution. Therefore, parallel computing is implemented to enhance the computing speeds by running several different tasks simultaneously on the same data [7] [8]. Parallel computing refers to the breaking process of a more significant problem into smaller, independent parts. Often it can be executed concurrently by multiple processors communicating via shared memory then the results are combined upon completion as part of the overall algorithm. The main purpose of parallel computing is to increase the available computing power for faster application processing and troubleshooting.
This research aims to develop a text mining application that adapts a text similarity algorithm for the biomedical domain to identify the relationship and relevance between large-scale documents. The implemented algorithms are run on a set of the published article from the biomedical documents to which keyword annotations by experts exist to compare with automatically extracted keywords by a parallel computing engine.

II. Methods
In this study, the similarity identification framework provided a guideline to conduct and organize the research properly. The framework illustrated in Figure 1 showed the workflow divided into several research phases that describe the action plan step by step as a guide to complete this study. Each phase will require the output to ensure that the research goals are achieved successfully.

A. Master Data
PubMed is an open-access search engine launched in January 1996 and made freely available online one year and a half years later. It has become one of the most commonly used search tools for retrieving scientific data. An almost continuous increase in the performed searches has been observed in Biomedical and Life Sciences [2] [16][17] [18]. PubMed is a search tool provided by the United States National Library of Medicine (NLM). MEDLINE is a central bibliographic database maintained by the United States National Library of Medicine (NLM), is the most commonly used electronic database in applied, systematic reviews of biomedical research. It covers articles published from 1946 to the present, primarily in a scholarly journal. This database is freely accessible via the PubMed website for 24 million records. The sample of PubMed documents is depicted in Figure 2.  Figure 2(b) shows the dataset represented in the XML format. Each XML file consists of different publication articles; more than three thousand articles are in every XML file. Dataset will be stored in MongoDB to support the parallel computing process for document similarity identification. MongoDB is the most popular NoSQL database system [19].
MongoDB is a cross-platform document-oriented database system. As a NoSQL database, MongoDB avoids traditional table-based relational database structures that support JSON documents with dynamic schemes, making data integration in some application types easier and faster. Data is stored in a document consisting of key and value with type and size variable (not set before). Figure 3 illustrates the sample of the PubMed documents stored in MongoDB. The data successfully inserted in MongoDB will be used for the following process. This dataset will be in JSON format inside the MongoDB collection with the same tag as data in XML format. This tag can be used for reading the data for the following process. MongoDB does not use the query to read the data like a SQL database.

B. Documents Similarity Engine
Machine learning is a type of artificial intelligence that can learn from the data without explicit instructions and follow the instructions programmed [4]. Machine learning will assist in finding a solution optimizing performance by using sample data or previous experience to gain new insights, reveal new patterns, and produce more accurate results. This research will implement machine learning in the documents similarity engine to identify the similarity between large-scale documents known as master data by automatically extracting keywords using node.js. JavaScript is a programming language that runs on the client or browser side only, then Node.js exists to complete the JavaScript role. It can also apply as a programming language running on the server-side, like PHP, Ruby, or Perl. With parallel computing, the process will reduce the processing time and quickly detect the relationship and relevance between large-scale documents.

1) Preprocessing
At this stage, the results obtained from the master data will automatically go through to preprocess. The tag used in this study is Mesh Heading and Abstract. Both of the tags can represent the entire contents of the article published as testing data. This preprocessing will reduce the number of words that exist by removing stopwords and changing the words into the basic form (stemming) [9] [20]. Stopword is words that are not a feature or unique word of a document like conjunctions. Taking into stopword in-text transformation will make the whole text mining system depend on the language factor. Therefore, it is a weakness of the stopword removal process. However, the stopword removal process is still used because this process will significantly reduce the system workload. By removing the stopword of a text, the system will only consider the considered important words.
Stemming reduces derived words to their word stem, base, or basic form. One of the most widely used stemming algorithms is the Porter Stemmer [9] [20]. The process of treating words with the same stem as synonyms, e.g., query expansion for search engines, is called conflation. The stem does need not be identical to the morphological root of a word since, for purposes of conflation, it is usually sufficient that related words map to the same stem even if this stem is not in itself a valid root. For example, the preprocessing depicts in Figure 4.

2) Representative Algorithm: TF*IDF
This phase is representative of algorithm TF*IDF. The TF*IDF-statistic short for term frequency times inverse the document frequency can extract keywords from a document by considering a single document and all documents from the corpus [2] [21]. The promising candidate for a keyword in a specific document if it shows up relatively often within the document and rarely in the rest of the where freq(P,D) is the number of times P occurs in document D, size(D) is the number of words in document D, df(P) is the number of documents containing P in the global corpus, and N is the size of the global corpus.

3) Cosine Similarity
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them [9][14] [15]. Cosine Similarity measures the similarity between two vectors in a dimensional space obtained from the cosine value of the angle from the product of the two vectors being compared because the cosine of 0° is 1 and less than 1 for other angles values. The similarity value of the two vectors is similar when the value of cosine similarity is 1.
Cosine similarity is used in positive space, where the result is limited between values 0 and 1. If the value is 0, then the document is similar. If the result is 1, then the value is said to be dissimilar [9][14] [15]. This limit applies to some dimensions. Therefore, cosine similarity is most often used in high-dimensional positive spaces. For example, in Information Retrieval, each term is assumed to be a different dimension. Furthermore, the document is marked with a vector where each dimension corresponds and how many terms appear. Equation (2)  (2) where A i and B i are components of vectors A and B. A is the weight of each feature in vector A. B is the weight of each feature in B. If it is associated with information retrieval, then A is the weight of each term in document A, and B is the weight of each term in document B. In this study, cosine similarity is used because large-scale PubMed documents are high-dimensional data. In large-scale PubMed documents that contain many published articles, it also can be said that each document consists of many different tags. Measurement of similarity can be done by comparing document 1 with document 2 then the system will calculate the similarity value.

C. Similarity Identification Result
In this stage, the identification results of document similarities will be represented in a graph, statistical table, and web application. The visualization data using a graph and statistical table are intended to make it easier to present and understand the result [4] [22]. Meanwhile, web application development can enhance the end-user experience and real-time data collection and provide custom content [22]. This study will show the graph and statistical table in the web application after the document similarity engine process has finished. For example, the PubMed Application interface web application depicts in Figure 5. The documents will be uploaded to the application. The application will automatically calculate the similarity between biomedical documents with parallel computing, Fig. 4. Preprocessing reducing the processing time and quickly detecting the relationship and relevance between large-scale documents. Therefore, the results will be in the form of a graph and table that facilitate reading the calculation results.

III. Results and Discussions
The PubMed application developed as an identification documents similarity engine as an intelligent application that automatically calculated the similarity between biomedical documents then visualized the identification result in the form of a graph and table. The calculation process is used parallel computing that is reduced the processing time and quickly detects the relationship and relevance between large-scale documents. The first process is storing the master data in MongoDB. Then the punctuation will be removed, converted to lower case, implemented stop word removal, and extracted the basic word using the Porter Stemming algorithm. Two tags were used in this study, abstract and mesh heading. This tag can be used to read the data for the next process. Figure 6 depicts the sample abstract dataset from PubMed publications captured from MongoDB. In addition, the captured dataset is then transformed into the basic word. The basic word is the biomedical word, including the chemical formulation, medicine name, and others. Therefore, this need is needed to be considered. The listing program to get the extracted keywords can be seen in preprocessing program. The input in preprocessing program is all abstract data, and the output is the string of each word from the abstract. The first step of preprocessing is removing all conjunction and punctuation in the abstract then transforming the letter into lowercase. The next step is stemming the words into their roots.
Preprocessing program Input: abs_all Output: all_string Initialization var abs_all, all_string, removed_conjuction, text_array, reg, rm_punctutation, reg removed_conjuction  abstrak_fix.replace(regex_rm_conjuction," ") text_array  removed_conjuction.replace(/(\s)?\d\s+/g, ' ').replace(/\n+/g,' ').split(" " The sample of extracted keywords result is depicted in Figure 7. Afterward, the extracted keyword weighting is carried out to calculate the frequency of occurrence of each word of the testing document in each document in the dataset. This phase is representative of algorithm TF*IDF. The TF*IDF can extract keywords from a document by considering a single document and all documents from the corpus. Finally, the TF*IDF calculation result is used to calculate the similarity of the documents testing with the PubMed documents using the cosine similarity algorithm. The listing program to get the term frequency value can be seen in the TFIDF program. The sample of TF*IDF results stored in MongoDB is captured in Figure 8.

TFIDF program
Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in 0 and 1. This similarity calculation will result in a value between 0 and 1. The closer value to 1, then both documents are more related, vice versa. From the similarity process that has been done, the cosine similarity produces similarity values between one document compared to other documents. The document comparison focused on the Abstract and Mesh Heading tag of the PubMed publications document as the testing data. The listing code to measure the cosine similarity between documents can be seen in the cosine similarity program. The cosine similarity results shown in Figure 9 illustrated the sample result of cosine similarity between abstract text with different abstracts in other publications and mesh heading text with the different mesh heading in other publications. For example, the cosine similarity between document 2 and document 1 between the mesh heading of published articles in the PubMed documents is 0.0045 and indicates that the cosine similarity is 0.45%. Figure 10 illustrates the result of cosine similarity measurement between documents. In this case, it is using abstract and mesh heading text in each PubMed document. The graph of the cosine similarity result from this PubMed document is shown the mesh heading texts cosine similarity is higher than the abstract text. The results showed that the relevance between mesh heading text in the PubMed documents is higher than the relevance of the abstract text in the PubMed documents. Hence, the relationship and correlation between published articles in PubMed documents can be known from the mesh heading text. The number of words and terms in the abstract can affect text similarity results. Besides, this mesh heading tag can be used for subsequent data processing, such as classifying or clustering the PubMed documents. Both visualizations of the calculation similarity result depicted in Figure 9 and Figure 10, known as similarity identification results, make it easier to present and understand the comparison result. This identification similarity result is shown in the PubMed application. In addition, this result is produced by the parallel computing engine in the PubMed application that reduced the processing time and quickly detected the relationship and relevance between large-scale biomedical documents. Meanwhile, Figure 11 is shown the execution time of the similarity engine application. The execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. The execution time of mesh heading is higher than abstract because abstract contains more words than mesh heading.

Cosine_similarity program
Documents similarity identification application has successfully identified the similarity between large-scale documents of the PubMed documents known as biomedical documents. The implemented cosine similarity and parallel computing as the document similarity engine is executed the documents faster. The execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. Based on the results, the mesh heading runtime is higher than the abstract because the abstract contains more words than the mesh heading. In addition, using the abstract and mesh heading tag can represent the similarity between documents. The result is shown that the cosine similarity of the mesh heading texts is higher than the mesh abstract text.

IV. Conclusion
The documents similarity identification application has successfully identified the similarity between large-scale documents of the PubMed documents known as biomedical documents. This study implemented cosine similarity and parallel computing as the document similarity engine that executed the documents faster. The execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. The mesh heading runtime is higher than the abstract because the abstract contains more words than the mesh heading. Therefore, using the abstract and mesh heading tag can represent the similarity between documents-the result is shown that the cosine similarity of the mesh heading texts is higher than the mesh abstract text. Besides, the results showed that the relevance between mesh heading text in the PubMed documents is higher than the relevance of the abstract text in the PubMed documents. On the other hand, the number of words and terms in the abstract can affect the percentage of text similarity results. In the future, this mesh heading and abstract tag can be used for the next data processing, such as classification or clustering datasets.