Generating Javanese Stopwords List using K-means Clustering Algorithm

Stopword removal necessary in Information Retrieval. It can remove frequently appeared and general words to reduce memory storage. The algorithm eliminates each word that is precisely the same as the word in the stopword list. However, generating the list could be time-consuming. The words in a specific language and domain must be collected and validated by specialists. This research aims to develop a new way to generate a stop word list using the K-means Clustering method. The proposed approach groups words based on their frequency. The confusion matrix calculates the difference between the findings with a valid stopword list created by a Javanese linguist. The accuracy of the proposed method is 78.28% (K=7). The result shows that the generation of Javanese stopword lists using a clustering method is reliable.

Stopword removal necessary in Information Retrieval. It can remove frequently appeared and general words to reduce memory storage. The algorithm eliminates each word that is precisely the same as the word in the stopword list. However, generating the list could be time-consuming. The words in a specific language and domain must be collected and validated by specialists. This research aims to develop a new way to generate a stop word list using the K-means Clustering method. The proposed approach groups words based on their frequency. The confusion matrix calculates the difference between the findings with a valid stopword list created by a Javanese linguist. The accuracy of the proposed method is 78.28% (K=7). The result shows that the generation of Javanese stopword lists using a clustering method is reliable.

II. Materials and Methods
The goal of this study is to generate a stopword list from the Javanese stopword corpus. The selected Javanese level of politeness is Ngoko, due to its usage and vocabularies [11] [12]. Figure 1 shows the four stages in conducting this research.
The first stage is data collection. The dataset used was taken from the website Ki-demang.com in the Javanese Short Stories category. The data consists of 106 stories without considering page numbers and titles. The collection of stories is combined into a text document, used as the stopword generation dataset.
The second stage is data preprocessing: case folding, punctuation removal, tokenizing, and filtering. The first preprocessing, case folding, changes uppercase letters into lowercase letters. The punctuation removal deletes the punctuation characters and numbers from the dataset. Furthermore, the tokenizing step spits the dataset into a single word. This step produces 17,763 types of words and their frequency. The result of tokenizing is words, cleared from typographical errors, words without meaning, names, and non-Ngoko words, resulting in 14,384 types. This deletion is based on a Javanese-Indonesian and Indonesian Javanese translation dictionary. Table 1 shows examples of deleted words.
The dataset of 14,384 different words is submitted to Javanese linguists. The linguists group the dataset into two classes, namely stopwords and non-stopwords. Furthermore, general words (conjunction) considered as stopwords are 3,224 words. The non-stop words consist of 11,160 specific words: noun, verb, and adjectives. Table 2 shows the example of two categories.
The third stage is clustering the 14,384 unique words and their frequency. Figure 2 shows the pseudocode of the k-means clustering method [16].
The first k-means clustering stage determines the k value or the number of clusters. In the study, the k value is k=3, k=5, k=7, k=9, k=11, k=13, and k=15 [17]. The next step calculates the distance between data and centroid using Euclidean Distance [18].
Here, the results of each case are recognized in two classes: stopwords and non-stopwords. All words in cluster 0 are labeled as non-stopwords, while stopword is all words in other clusters. For example, if k=7, each word in the cluster 1 to 6 are stopwords, while the rest (in cluster 0) is nonstop words. This first assumption is based on the observation that words with high frequency [19] are outside cluster 0. Table 3 illustrates one example of the frequency distribution of stopwords when k=7. In this case, 680 words is labeled as stop words, where 13704 words are non-stopwords.  The fourth stage is evaluation, which aims to test the performance of the proposed method. The opinion of experts is used as a reference. A confusion matrix is applied to calculate accuracy and precision [20]. At this stage, all cases are tested to decide the best stopwords set based on the kmeans clustering technique.
The accuracy is obtained by dividing the number of only correct documents by all documents [21]. The true value means that the clustering results have the same class as the reference. On the other hand, precision is the comparison of true positive (TP) with the total of true positive and false positive (FP) [21]. TP means that when the result of clustering is a stopword and it is the same as the reference. FP means that the predicted result is stopwords while the reference is non-stopwords.

Output:
One set k cluster.

Steps 1:
Randomly select k centroid from D as the initial centroid (center of the initial cluster) Step 2: Determine each item in the cluster that has the closest cluster center; Calculate new averages for each cluster; Step 3: Repeat step 2 until the centroid cluster value does not change or until the maximum number of iterations is reached Fig. 2. K-means clustering algorithm Table 4 shows the performance of the stopword list using k-means algorithm. The accuracy and precision represent the method performance by comparing the result with Javanese linguists' manual classification.

III. Results and Discussions
In Table 4, the highest accuracy is 78.2%, with 57.3% precision. The cluster supports this result with a value of k = 7. The result consists of 680 stopwords and 13704 non-stopwords, while the experts identify 3,224 and 11,160 of the same categories. The cluster can correctly indicate 11,030 of 14,384 words, which is dominated by non-stopwords category. Figure 3 shows the distribution of the word based on the first assumption that the first cluster is the non-stopwords.
As seen in Figure 3, experts recognize most words as non-stop words. The k-means wrongly categorized the non-stopwords into stopwords category (area within the grey line). On the other hand, the precision is 57.3% of the orange and gray areas, which means that most stopwords are categorized as non-stopwords. The lowest performance is when k=5. The accuracy is 25 %, and the precision 21.7%. Only 3089 is true stopwords, and 65 words are true non-stopwords.
The second assumption is then applied for comparison. Table 5 shows the result, assuming the first cluster is the stopwords, while the rest is non-stopwords. The best performance in Table 5 is when k = 5, where the accuracy value is 78.07% and the precision value is 67.5%. This case indicates 135 true stopwords and 11095 true non-stopwords. The obtained precision is 67.5%, which is equal to 135 of 200 stopwords.
The accuracy of both scenarios (Table 4 and Table 5) is similar. However, the precision of the best scenario (k=5) in Table 5 is higher than the best of Table 4 (k=3). It means that the performance second assumption is more promising than the first in recognizing the stopwords. Therefore, kmeans locates stopwords in the first cluster while the-nonstopwords are in other clusters.

IV. Conlusion
K-means is applicable for Javanese stopwords list generation. The algorithm indicates the stopword location is in the first cluster of the words list. However, the current promising result is still possible to be improved. Further research should consider the balance of frequency distribution and the implementation of word stemming in the preprocessing. The use of more training data may balance the frequency, while the stemming may combine the unique words and unites the occurances of combined words.