Maximum Marginal Relevance and Vector Space Model for Summarizing Students' Final Project Abstracts

ABSTRACT


I. Introduction
A summary represents the article's overview and conveys essential ideas to the reader [1].Automatic text summarization reduces a text document with a computer program to create a summary that retains the essential parts of the original document [2] [3].The amount of data is increasing to deal with information overload, so automatic summarization is necessary [4].Summary automation can be applied to single-multi documents and languages [5].Therefore, an automatic summarizer may ease people in summarizing the data from the web page [6] [7], as in the final project and thesis abstract [8].
Maximum marginal relevance (MMR) is an extractive summary method that is used to summarize a single document or multiple documents [9] [10].MMR summarizes documents by calculating the similarity between parts of the text [11] [12].The document segmentation process is carried out in sentences summarizing documents using the MMR method.MMR combines the cosine similarity matrix and VSM to rank sentences in response to the query [13] [14].Most modern information retrieval (IR) search engines produce ranking lists of documents as measured by decreasing relevance to user queries [15] [16].The first assessment to measure the relevant summary results is to measure the relationship between the information in the document and the query given by the user and add the linear combination as a matrix.This linear combination is called marginal relevance [17].Automatic summarization is reducing a text document with a computer program to create a summary that retains the essential parts of the original document.Automatic summarization is necessary to deal with information overload, and the amount of data is increasing.A summary is needed to get the contents of the article briefly.A summary is an effective way to present extended information in a concise form of the main contents of an article, and the aim is to tell the reader the essence of a central idea.The simple concept of a summary is to take an essential part of the entire contents of the article.Which then presents it back in summary form.The steps in this research will start with the user selecting or searching for text documents that will be summarized with keywords in the abstract as a query.The proposed approach performs text preprocessing for documents: sentence breaking, case folding, word tokenizing, filtering, and stemming.The results of the preprocessed text are weighted by term frequency-inverse document frequency (tf-idf), then weighted for query relevance using the vector space model and sentence similarity using cosine similarity.The next stage is maximum marginal relevance for sentence extraction.The proposed approach provides comprehensive summarization compared with another approach.The test results are compared with manual summaries, which produce an average precision of 88%, recall of 61%, and f-measure of 70%.
A collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources [18].The paper describes the datasets and standardized data loaders that were brought together through this initiative and discusses the quality of the datasets, which were assessed manually and automatically.We compared the performance of our approach with a summarization initiative from NusaCrowd.This article consists of four sections.The introduction and context are covered within the first section.The second section describes the research method.The fourth segment describes the results and discussion, while the final section summarizes the conclusions.

II. Method
This research summarizes a document and generates its abstract using an automatic summary system [19] [20].The stages in this research are preprocessing, tf-idf weighting, weighting query relevance, sentence similarity weighting, and MMR for summary extraction [21], as displayed in Figure 1 [22] [23].The abstract documents are generally preprocessed (sentence splitting, tokenization, case folding, stopword, and stemming).After preprocessing, tf-idf weighting is carried out, namely, automatic weighting based on the number of occurrences of a word in a document (term frequency) and the number of occurrences in the document collection (inverse document frequency) [24].The tf-idf weights and calculates the query relevance and sentence similarity weights for weighting query relevance using the vector space model and sentence similarity using cosine similarity [25].The calculation of the query relevance weight is the weight of the results of comparing the similarity between queries (keywords) to the entire document.At the same time, the sentence similarity weight is the weight of the results of comparing similarities between documents.The next stage of iterative calculations uses maximum marginal relevance by comparing query relevance and sentence similarity to obtain summary extraction to determine the relevant document as a summary [26].
The first step in the text preprocessing stage.is sentence division breaking down documents into sentences.Sentence splitting is breaking long document text strings into a collection of sentences.In breaking the document into sentences using the split () function, with a period ".", question mark "?" and an exclamation point "!" as a delimiter.The sentence splitting stage breaks the document string into a collection of sentences by removing the end of the sentence marks (delimiters).From the results of sentence splitting, the following steps are tokenization, case folding, stop words removal, stemming, tf-idf weighting, VSM, cosine similarity, and MMR to obtain a summary.
Tokenization is cutting or separating a row of words in a sentence, paragraph, or page into tokens or single-word chunks.This stage also removes certain characters in the form of punctuation marks.Splitting sentences into single words is done by scanning sentences with white space delimiters (spaces, tabs, and newlines).
Case folding is a text processing process in which all text is converted into the same case; in this case, the text is represented in all lowercase letters.The orthographic model error will be corrected by changing all letters to lowercase or lowercase.The following is an example of implementing case folding in summarization [27] [28].
Stop words can be referred to as unimportant words, for example, "in", "by", "on", "a", "because", and so on [29] [30].Stop words are removed to remove words that have no connection with documents contained in the database.Examples of other stop words are there, is, is, while, somewhat, he, I, how, and others.Stemming removes a word's prefix or suffix to get the basic word form.For example, registered words and registrations share a common term, stem list [31].
The weighting is obtained based on the number of occurrences of a term in a tf document and the number of occurrences in the idf document collection [32].The more frequently a word appears in a document, the greater its weight and the smaller it appears in many documents.To calculate the tfidf weight, use the formula in ( 1) and ( 2).Weighting can be obtained based on the number of occurrences of a term in a term frequency (tf) document and the number of occurrences of a term in the inverse document frequency (idf) document collection.The idf value of a term can be calculated as in (3).
is the number of documents containing the term (), and  is the number of term occurrences against .The algorithm is used to calculate the weight () of each document against keywords (queries).

𝑊𝑑, 𝑡 = 𝑡𝑓𝑑, 𝑡 * 𝐼𝐷𝐹𝑡
(2) is the d-th document,  is the t-th term of the keyword,  is the term frequency or word frequency, and  is the weight of the d-th document against the t-term.After each document's weight () is known, a sorting process is carried out where the greater the value of , the greater the degree of similarity of the document to the word you are looking for, and vice versa.
After calculating each document's W weight, calculate the query relevance weighting using (2).From the query relevance values and rankings obtained in Table 1, documents with the highest query relevance weight are displayed sequentially based on their ranking, namely D3, D4, D6, D1, D2, D8, D7, and D5.The query relevance value will be compared with the sentence similarity value for summary extraction.The vector space model measures the similarity between a document and a query [33].In this model, queries and documents are considered vectors in an n-dimensional space, where n is the sum of all the terms in the lexicon.The lexicon is a list of all the terms in the index.One way to overcome this in the vector space model is to expand the vector.The expansion process can be performed on query vectors, document vectors, or both of these vectors.The relationship between words in databases, documents, and keywords [34].
Cosine similarity is used to calculate the query relevance approach to documents.Determining the relevance of a query to a document is seen as a measurement of the similarity between the query vector and the document vector.The greater the similarity value of the query vector with the document vector, the more relevant the query is to the document [35] [36].
When the engine receives a query, it will build a vector  (1, 2, … , ) based on the terms in the query and a vector  (1, 2, … , ) of size t for each document.In general, cosine similarity (CS) is calculated using the cosine measure formula [37][38].This study calculates it using cosine similarity, namely the similarity approach between documents.This study measures the distance between the two documents ( and ), using the cosine similarity formula to calculate the similarities between documents.In vector space, the document model is represented in the form  = {1, 2, 3, . . ., } where  is the document and  is the weight value of each term in the document.
The cosine 0 o is one and is less than 1 for every other angle.Thus two vectors with the same orientation have a similarity cosine of 1, and two vectors at 90 o have a similarity of 0. Cosine similarity is mainly used in positive space, where the result is bounded by (0,1).
is a word in the database,  is a document resulting from splitting sentences, and  is a keyword in the abstract.
Cosine similarity is used to calculate sentence similarity weights, where each document is compared to others.The flow of calculating sentence cosine similarity is the same for calculating the query relevance weights using (4).Table 2 shows the results of the sentence similarity weighting calculation resulting from the cosine similarity calculation.The results obtained on sentence similarity weight values are used to calculate the MMR iteration by comparing the results of query relevance weights and sentence similarity.Summary extraction was performed using (5).The MMR calculation is done by comparing the query relevance results and sentence similarity results.Documents have high marginal relevance if the document is relevant to the contents of the document and has the maximum weight similarity with the query.The final value is given to the Si sentence in MMR calculated by (1).
is a sentence in the document, while  is a sentence selected or extracted.The coefficient  is used to adjust the value combinations to emphasize the sentence's relevance and reduce redundancies.In this study, 1 and 2 are two similarity functions that represent the similarity of sentences in all documents and choose each sentence to be used as a summary.1 is the  sentence similarity matrix to the query, while 2 is the  sentence similarity matrix to the sentence [31].
The parameter value  is from 0 to 1 (range [0,1]).When the parameter =1, the MMR value obtained will tend to be relevant to the original document.When =0, the MMR value obtained tends to be relevant to the previously extracted sentences.Therefore, a linear combination of the two criteria is optimized when the value  is in the interval [0,1].For summarizing small documents, such as news, use the parameter value  = 0.7 or  = 0.8 because it will produce a good summary [39].To get relevant summary results, we set the value  to a value that is closer to .The sentence with the highest MMR value will be repeatedly selected into the summary until the desired summary size is reached as in Table 3.Because in the MMR calculation, the values taken as a result of the iteration are more significant than 0, the iteration stops at the 4th iteration because all values are less than 0. Then the values from documents D4, D3, and D6 are considered relevant for summary results.Table 4 shows the maximum iteration MMR weight obtained from the iteration calculation.Iterations are carried out as many times as the number of documents resulting from sentence splitting, but the one with a positive value or 0 to 1 is taken as a summary.The MMR calculation results show that the document is a summary based on the sequence the highest sentence MMR weight is in Table 5.
From the results of the maximum MMR iteration MMR, it has been determined the order of the relevant documents to be used as a summary, and these documents are sorted by highest to lowest value between weights 0 to 1.Moreover, higher results will place the initial position in the summary.
Because of the results of the maximum marginal relevance calculation, the highest value is taken from all iterations.Documents (D4), (D3), and (D6) are the most relevant and are considered sentences that match the keywords or queries between documents.
The maximum marginal relevance for summary extraction can be seen in Algorithm 1. Lines 1 to 6 of Algorithm 1, delete and create tables starting from the cosine, nmr, and summary tables.The next stage is to call data in the cosine table as in program line 7. Then proceed with reading records in the form of repetition as much as the number of data in the eighth program line.Lines 9 to 15 determine the number of documents stored in the cosine table with a call value field based on the value of the paper.Furthermore, in program line 16, it is repeated for the total number of documents.In this iteration, we call the cosine table with the SQL command, which is in the 17th program line.Lines 18 to 24 are calculated; the results of the MMR calculation are stored in the MMR table, which is located in line 25.In addition, data updates in the table are also performed.

D4
In the Student Affairs Section of SMA Negeri 1 Tarakan, the process of class promotion and student majors is still carried out in a simple manner by holding meetings and data is processed and stored using Microsoft Excel, so it takes a long time to calculate the process of class promotion and student majors due to the large number of students that must be handled by SMA Negeri 1 Tarakan, so the quality of the results of the process of increasing majors is less accurate, slow and tends to experience differences in decisions between students.

D3
The Student Affairs Section is also a center for processing student data and is also tasked with determining the process of grade promotion and student majors at SMA Negeri 1 Tarakan.

D6
The application in this program starts from the decision tree process for class increases, the decision tree process for the Science majors (Natural Science), Social Sciences Majors, and Language Majors, Student Entry Processes, Class Promotion Processes, Major Processes, and Reports Class Promotion and Majors Report.
Lines 26 to 31 are called the MMR table by reading the document field and storing it in the summary table.Then delete the cosine table based on the document field, delete the mmr table, and create the MMR table as in the 32nd to 34th program line.Lines 35 to 37 call and read summary tables accommodated in the variable data_mmr in the form arrays.The program line 38 is a sentence variable with an empty value to combine the following sentence.Lines 39 to 45 are summary table readings with SQL commands based on the final field of more than 0, which is repeated as much as the sum of the summary data according to the results of the SQL search.Then the sentence is combined with the previous sentence repeatedly.
The summary evaluation is measured by comparing the manual and automated summaries [41].Manual summaries were obtained from manual summaries of 20 respondents and calculated with precision, recall, and f-measure values as in ( 6) to (8). = (#    ) /    (6)  = (#    ) /     (7) f -measure = (2 * Precision * Recall) / (Recall + Precision) (8) Algorithm 1: Maximum marginal relevance for summary extraction 1: alter file cosine drop field value 2: alter file cosine and add field value as real 3: drop file mmr 4: create file MMR has field document, last 5: drop file summary 6: create a file summary that has a field document, last 7: READ all fields and count as field name total (from file cosine) 8: WHILE(row is not empty) 9: total ← GET field total

III. Result and Experiments
The data used in the experiments consisted of 200 final project and student thesis abstract documents obtained from the STMIK PPKIA Tarakan Library.Testing is done by entering the contents of the student's final project abstract and abstract keywords.The query is a keyword of the abstract.Sentences taken as a summary represent queries and have a maximum MMR [40] weight between the maximum weight values of 1 to a minimum of 0. The more words similar to the query, the greater the chance for data to be retrieved as a summary.Table 6 shows an example of an evaluation calculation using three documents taken randomly from the data abstract document.From the summarization results, a comparison was made with the respondents' manual summary.The recall, precision, and f-measure can be seen in Table 7. Table 7 shows the calculation results obtained from precision, recall, and f-measure calculations.The average obtained from these calculations produces an average for precision of 61%, recall of 72%, and f-measure of 66%.Table 8 summarizes the results of 200 abstract documents of student final assignments.The summary of this model is then compared with the manual summary, which has been done by 20 people summarizing the manual with a summary of 200 abstract documents.Table 9 shows that the results of the comparison between summarization and manual summaries have an average recall value of 61%, precision of 88%, and f-measure of 70%.Table 10 shows the comparison between the MMR summary result and another model [41].As seen in Table 10, the Bert2-GPT-Id has a shorter summary than MMR.However, the MMR summary has more comprehensive results than the baseline.In other words, MMR has better performance than Bert2-GPT-Id.Table 10.An example of MMR test compared to Bert2-GPT-Id

Code
Abstract MMR Bert2-GPT-Id Modeling is a real system representation of objects by taking a mathematical form and a logical relation.In general, a simulation is defined as a dynamic representation of a portion of the real world using a computer and running at a certain time.One of the modeling techniques is Discrete Event Simulation (DES), modeling a system that changes every unit of time.This method is stochastic, dynamic, and discreteevent.Many fast food restaurants offer a variety of menus and services to satisfy consumers.Kentucky Fried Chicken Restaurant, Tarakan Branch, is one of the most popular fast food restaurants.The increasing number of users of delivery services and different distances, of course, the travel time is also different, resulting in the emergence of new problems in the delivery process.The problem that often occurs at the Tarakan Branch KFC Restaurant is that at certain times KFC receives orders from very many consumers and can make the process of sending orders to consumers slow due to limited employees who specifically handle message delivery services.This will create a queue in the process of sending orders.In this final project, a discrete event simulation model will be implemented using a combination of Fixed-Increment Time Advance and Next-Event Time Advance to overcome problems that occur at the Kentucky Fried Chicken restaurant, Tarakan Branch, using the Delphi 7.0 programming language.
In this final project, a discrete event simulation model will be implemented using a combination of Fixed-Increment Time Advance and Next-Event Time Advance to overcome problems that occur at the Kentucky Fried Chicken restaurant, Tarakan Branch, using the Delphi 7.0 programming language.
The simulation is based on the subject of a real system of objects using a computer and runs at a certain time.
Keywords discrete events simulation, message delivery, kentucky fried chicken

IV. Conclusion
Several conclusions are obtained from the discussion and experiments previously conducted in this study.Documents with the highest maximum marginal relevance value from the calculation will be taken as a summary.Sentences taken as a summary represent similar sentences in documents with queries and similarities between sentences in documents.The maximum marginal calculation is done by calculating iterations between combinations of query relevance and sentence similarity matrices.Calculation of query relevance weights is the weight of comparing similarity between queries to documents, while sentence similarity is the weight of the results of comparison of similarities between documents.Vector space modeling is used to query relevance and cosine similarity for sentence similarity.
From the results of the lambda test with a comparison between the lambda values of 0.8, lambda 0.3, and lambda 0.9, it can be concluded that using a lambda value closer to 1 produces a more relevant summary.The results of the experiments are an average precision of 88%, recall of 61%, and f-measure of 70% based on a comparison between the summarization and the manual summary.The test data was taken from 200 student final assignments and thesis documents.Furthermore, used as data in the summarization and manual summary, the summarization results data are compared with the manual summary to obtain accurate results.Moreover, the time needed to summarize one document depends on the number of sentences obtained from document splitting.The more sentences in the document, the longer it takes to summarize.Some future works are as the results of the comparison with the manual summary show that several abstracts have a low f-measure value because the query sometimes does not describe the content.The retrieved sentences are not in good sentence order.Also, it is recommended to use a generator for abstract keywords.Quality measurement with other parameters, such as F-Score and NMI, is also possible in future research.

Table 2 .
Sentence similarity weight values

Table 4 .
The result of the maximum MMR iteration MMR weight

Table 6 .
Automated summarization and manual summarization

Table 7 .
Results of example calculations of precision, recall, f-measure comparison of summarization and manual summary

Table 9 .
Comparison results