Indonesian Sentence Boundary Detection using Deep Learning Approaches

Article history: Received 7 February 2021 Revised 22 May 2021 Accepted 21 June 2021 Published online 17 August 2021 Detecting the sentence boundary is one of the crucial pre-processing steps in natural language processing. It can define the boundary of a sentence since the border between a sentence, and another sentence might be ambiguous. Because there are multiple separators and dynamic sentence patterns, using a full stop at the end of a sentence is sometimes inappropriate. This research uses a deep learning approach to split each sentence from an Indonesian news document. Hence, there is no need to define any handcrafted features or rules. In Part of Speech Tagging and Named Entity Recognition, we use sequence labeling to determine sentence boundaries. Two labels will be used, namely O as a non-boundary token and E as the last token marker in the sentence. To do this, we used the Bi-LSTM approach, which has been widely used in sequence labeling. We have proved that our approach works for Indonesian text using pre-trained embedding in Indonesian, as in previous studies. This study achieved an F1-Score value of 98.49 percent. When compared to previous studies, the achieved performance represents a significant increase in outcomes. This is an open access article under the CC BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/).

detection can help the pre-processing phase and further improve the performance results. We choose the Deep Learning approach to simplify the learning process without crafting any rules by hand as in the traditional machine learning approach. We decide to use Bidirectional LSTM because of its ability to remember long-term sequences from two-way directions. By using this model, we do not need a handcrafted feature like in previous research. This model only needs the token of words.
There are several reasons why we conduct this research for Bahasa Indonesia. The main reason is the limitation of available tools and resources. Moreover, there is a need for tokenizing sentences. Natural Language Processing approaches can use the tokenizing task as a basis for further tasks. Sentence Boundary Detection is crucial as a pre-processing phase of many natural language processing tasks. One use is on Simultaneous Translation, where Sentence Boundary Detection could detect sentences before the translation process [11]. Sentence Boundary Detection also is needed for chatbot [12], machine translation, named entity recognition, and coreference resolution [13].
Our contribution is aimed directly at text processing in Bahasa Indonesia. The result of sentence boundary detection might be used for extracting information or even further, like solving another natural language processing problem. To our knowledge, we are the first to propose Sentence Boundary Detection with Deep Learning in Indonesian. After the tokenization process of a document, sometimes the determination of punctuation as the end of a sentence gives ambiguity whether it is the ending of a sentence or not. In this research, the sequential learning method is used to classify each token whether it marks the end of a sentence or not. We use Deep Learning to provide a crucial preprocessing of text that detects each sentence from a text document. Our sentence boundary detector can be used as a feature extractor for later tasks. Furthermore, we also prove that the deep learning model is capable of detecting sentence boundaries. Our approach could achieve a higher F1 Score than the previous approach, and no need to build any handcrafted rules.

II. Method
This section explains the steps of our research framework. The first step is explaining how we build our corpus for sentence boundary detection. This section explains how to get the raw data until processed as a labeled dataset, followed by further discussion of the proposed architecture. The discussion is divided according to each architecture layer: input layer, Bidirectional LSTM cells, and output layer. This section also includes an additional explanation of the used optimization method.

A. The problem in Sentence Boundary Detection for Bahasa Indonesia
This section will explain some problems that occur when detecting sentence boundaries for Bahasa Indonesia [33]. All of them are based on the ambiguities that punctuation marks may not always end a sentence [34]. We have listed each problem with few examples. There are also several points that we discuss to explain each case.
The first problem is writing the title and degree. When writing someone's title, the writer often uses the short version of the title or degree. As seen in the first example, "H" is a title for someone who used to have a pilgrimage. "Ir" is an academic degree for an engineering major. The title "H" stands for "Haji" and the title "Ir" stands for "Insinyur". This example shows the use of a stop mark to shorten the writing of the title or degree. The full stop mark in the title and degree does not end the sentence. This case is different when the title or degree is placed at the end of the sentence. In the second example, the stop mark in "Kom." ends the sentence because it is the last word.
Abbreviation for names case comes when writing a long name. The writer usually makes the name shorter by using each word's first character and gives a full stop mark on each abbreviation. This abbreviation is written in uppercase letters. It is hard to list all of the abbreviations for names because many names are used in the document collection. Full stop mark in abbreviations for names does not end a sentence. However, it ends the sentence if the abbreviation is placed at the end of the sentence. This case is similar to the first case in 2.1, which happens in writing someone's title and degree. As seen in the first example, the writer use "W" which stands for "Widodo" to shorten the name. In the second example, the stop mark after "S" stands for "Santoso" ends the sentence because it is the last word.

Kelas kami kedatangan alumni bernama Joan S.
Our class is visited by an alumnus named Joan S.
The third problem is related to common abbreviations. There are some standard abbreviations used in Bahasa Indonesia. For examples: "a.n" (atas nama / by the name of), "s.d." (sampai dengan/until), "d.a." (dengan alamat/placed in), "jl." (jalan/street), "hlm." (halaman/page), etc. A full stop mark in this kind of abbreviation does not end a sentence. Usually, the writer uses these abbreviations in the middle of the sentence. The first example shows that the stop mark in "tgl" is shortened from the original word "tanggal". The second example also using the abbreviation "s.d." to shorten the original word "tanggal". In the third example, the writer could write the original word "Jalan" or just "Jl." for the shorter one.
He will go to Ngagel Jaya Street.
Time separator is considered as the fourth problem. Time can be separated using punctuation marks. The full stop mark in the time separator does not end the sentence. In the first example, the expression of time "10.30" does not end its sentence. It only separates that 10 is the number of hours and 30 is the number of minutes. The second example also uses a full stop mark to separate between hours and minutes. Both the first and second examples provide the use case of stop mark for time expression in a sentence.
The next problem, the money separator, can be expressed using punctuation marks. In the first example, the full stop mark in the expression "100.000" does not end the sentence. It separates the amount of money. Usually, people separate money per three digits in Bahasa Indonesia to make it easier for the reader. The second example also expresses the money format with the other currency used. "Rp." is the formal way to write Indonesian currency. There is another way to express money in Bahasa Indonesia, as stated in the third example. The only difference is in the use of ",-" to end the money expression.
Another problem is a number separator. A full stop mark is used to separate the number per thousand. It is not only used in expressing money but also when writing any number. For example, "1.123" in the first example contains full stop mark that separate number in expressing the number of people who died from the earthquake. The second example shows the use of a full stop mark to separate the number of smartphones. Almost any numbering expression uses a full stop mark to separate per thousand. This separation is similar to money separation to make the reader easier to read and understand.
There are 1,500,000 smartphones connected to our server.
The email-formatted text could be problematic, contain more than one full stop mark. The first example shows the standard email formatted text. However, the second example shows that the number of full stop marks in email can be as much as possible. Users can freely choose a custom name for their email. The third example shows that there are a lot of non-formal ways to write an email. In this case, building rules for each case is time-consuming. Moreover, email-formatted text can also be written like in the fourth example. The writer can use "dot" instead of a full stop mark.

Email dia adalah christian.np at indocl dot stts dot edu.
His email is christian.np at indocl dot stts dot edu.
Problem number eight is the username formatted text. Sometimes the writer takes quotes from social media and includes the username. There is no limitation on giving a full stop mark in the username. Full stop in username does not end the sentence. The first example shows the use of a full stop mark in the usual username "@christian.np". On the contrary, a username can also contain many full stop marks like the second example. "@christian.n.p.stts.sby" contains several numbers of full stop marks. This case rarely happens, but it is still possible for a username to have many full stop marks.

Akun @christian.np juga mengatakan hal yang serupa.
Account @christian.np also said the same thing. 2. @christian.n.p.stts.sby @joan.s. Ayo pergi ke Bali bulan depan! @christian.n.p.stts.sby @joan.s. Let's go to Bali next month! Sentence emphasis is often used when the author wants to emphasize some meaning from the text. This kind of writing is often found in drama script writing to express feeling through the writing. In addition, the writer can combine many different punctuation marks according to his or her creativity. This case is the same when handling free structured text from social media. Chat, comment, or post on social media does not have a fixed rule in writing. Users can write anything based on current trends, thus making a problem because the rule for splitting each sentence is different for each time.
Sometimes multiple punctuations can be combined to be a single token. Usually, the last punctuation mark in the token is the one that ends the sentence. A question mark may not finish a sentence when it comes together with another punctuation mark like "?!". As seen in Figure 1, the question mark after the word "Surabaya" does not end the sentence. The exclamation mark after the question mark is the one that ends the sentence. On the other hand, a single punctuation mark may not end the sentence if it is placed in line with another punctuation mark. The last punctuation mark in token "!!!" which is the exclamation mark, is the one that ends the sentence. The next problem happens in the dialogue text. Some conversations may consist of multiple sentences. When we try to split them up, we lose their context, which is used to determine these sentences belong to whom. The sentences seem to have their context, but those sentences are in the same context. We make an agreement that all spoken words from a person at a particular time will be counted as a single sentence, even if there is more than one sentence inside it. This agreement may be different from other sentence tokenizer tools where the text is tokenized based on the end of a sentence, not the context of the whole text.
"Hi! My name is Christian NP. I am glad to know you!" said Christian.
The first example is a common writing style in which dialogue contains only one sentence. The second example is more complex than the previous example. It consists of three different sentences, which are "Hai!", "Nama saya Christian NP.", and "Bagaimana harimu?". We count these three sentences as a single sentence, together with the main sentence. The context is the same because they are all spoken by one person at a particular time.
The last problem is the non-punctuation token. As we analyze our dataset, we found that the end of each sentence is not always a punctuation mark. It may occur when a non-word ends the sentence. This case usually happens when converting a list to plain text. Point by point in a list can either ends with a punctuation mark like a full stop mark or just a word. Figure 2 [35] shows the output of sentence tokenization from a list. Colon mark ":" ends the first sentence as a description of the list. The second sentence until the rest is split according to the number of the list. As we can see, the end of the second sentence until the rest is different. It may be "Widjojanto", "Husein", "Hehamahua", or other words that ends the sentence. On another view, the full stop mark after the index is combined with the current sentence. These numbers are used as an index and do not end the sentence.

B. Data Preparation
Our corpus is built from Indonesian news documents. All news is crawled from two news sites which are Detik and Kompas. Each news is then extracted and parsed to get the text. We remove unused information like ads, pictures, video, and audio because we only need plain text. Then, we conducted post-processing, which converts all list types to readable text and does tokenization at the word level. The product is a token that contains either word or a punctuation mark. In the last step, we split each sentence manually for all documents and crawled those sites from 2011 until 2012. There are 14,142 sentences in total from all documents.  Figure 3 [36] displays an example of the data preparation process. The rest of the dataset follows the same process. On the left side is the original HTML formatted text from Detik news. On the right side is the result with 20 sentences in total. Each sentence is split based on its context (as discussed in section 2). There is a section with a list of typed text in the last part, separated one by one per point. The numbering is essential information for further tasks.

C. Sequence Classification for Sentence Boundary Detection
Long Short-Term Memory (LSTM) is established for Part-of-Speech (POS) Tagging, Named Entity Recognition (NER), and Noun Phrase Chunking. Sentence boundary detection can be seen as a sequence classification problem where we want to label every timestep of the input or as a collocation identification problem [37]. Every token of input is predicted, whether it is the end of a sentence or not, based on the previous token. We build an architecture based on the nature of the problem. We pay much attention to the whole sequence rather than individual prediction. Thus, we use bidirectional LSTM to capture the sequential features from both directions (left to right direction and right to left direction). Figure 4 is the visualization of our system architecture. We divide our architecture into three different layers: input layer, sequence learning, and output layer. The input is a sequence of tokens from a single sentence, and the output is also a sequence of labels. In the input layer, it just simply converts each token into a vector using word embedding. Thus, the word vector is learned by the sequence learning layer. We use Bidirectional LSTM for sequence learning. In the end, all predicted results are converted into final predictions in the output layer. This prediction contains information on which token is identified as the end of the sentence and which is not. We also use one of the optimization methods to help the learning process. We use Adam optimizer for that purpose.
Our proposed input layer is token embedding because it converts from token input into vector. Every token is a string which can be either word or a punctuation mark. We use Skip-Gram Word2Vec as our embedding model. Skip-Gram Word2Vec is capable of giving a semantic representation of a token. It is also capable of providing the similarity of context from different words. However, Word2Vec has a drawback when handling unknown words. Word2Vec cannot provide the vector representation if the word is not trained before. To encounter this problem, we use a random trained vector to represent every unknown word. Sequence learning is used to predict the outputs from the given inputs. We use bidirectional LSTM, which uses two different LSTM cells. Each cell acts as a forward learner and a backward learner. Forward LSTM reads input from the first token to the last token, and backward LSTM reads input from the last token to the first token. The results from both of the cells will be concatenated. The gray circle on the figure denotes the input to LSTM Cell. The colorful circle denotes every gate in the LSTM cell, which consists of yellow for the activation gate, green for the input gate, red for the forget gate, and blue for the output gate. The last one is light blue for cell state, which holds long-term memory from several previous calculations.
Equations (1) to (6) are the mathematical functions for the Forward LSTM cell. (1) is the activation gate, (2) is the input gate, (3) is the forget gate, (4) is the output gate, (5) is the cell state, and (6) is the prediction from the Forward LSTM cell. Equations (7) to (12) are similar to equations (1) to (6). (13) is used as the final prediction of both LSTM Cells which use concatenation function to combine two vectors values.
The output layer converts every vector result from the sequence learner to be the predicted label using the Softmax function. The function provides a probability distribution for each label and then outputs the label with the largest probability. The output labels are "E" as "EndOfSentence" and "O". Label "E" or "EndOfSentence" means the current token is the ending of a sentence. Label "O" Fig. 4. System architecture (Others) represents that the current token does not end a sentence. Because of its sequential nature, every token input will have a single output label.
In this research, we choose Adam optimizer to obtain an appropriate gradient for each weight in networks. Adam combines adaptive learning rate and momentum. Technically, every weight is updated by using gradient calculated with Adam. Algorithm 1 is the pseudocode of the Adam optimizer. The default value for each hyperparameter is based on the original paper in [38].

III. Results and Discussions
We had done several experiments to prove the capability of our proposed architecture. We provide some test cases by fine-tuning a few hyperparameters. Besides, we also report a different approach by using standard LSTM to compare with our Bidirectional LSTM model. We ran different scenarios based on the changing of hyperparameters. Each scenario used the same dataset. We split our corpus by 70% (9,953 sentences) for training and 30% (4,189 sentences) for testing. The random seed was turned off to focus only on the original effect of the hyperparameters setting. There were two big categories based on the model we have tried. We tested every model by changing the hidden unit of LSTM cell, the number of layers, and training iteration. Table 1 contains all experiments using different kinds of methods. The row represents the method, and the column represents the number of iterations. Every method is either experimented on the LSTM or the word embedding. Based on the results in Table 1, we found that BiLSTM (Bidirectional LSTM) works better than UniLSTM (Unidirectional LSTM). Word embedding gives a small difference in overall accuracy. The number of iterations will increase accuracy but not a lot in the next iteration. We also conduct another trial to identify the effect of word embedding dimension by using 50% of the training document and separate as 70% sentences as training and 30% sentences as testing. The results are as follows 0.9843% for 50 dimensions, 0.9830% for 100 dimensions, 0.9832% for 150 dimensions, 0.9846% for 200 dimensions, 0.9850% for 250 dimensions, 0.9857% for 300 dimensions, 0.9808% for 350 dimensions, 0.9826% for 400 dimensions, 0.9863% for 450 dimensions, and 0.9826% for 500 dimensions. Our final result is 98.49% when using Bi-LSTM model with Word2Vec embedding and 100 iterations.
The second experiment was conducted by comparing the performance of the proposed method with several approaches from previous state-of-the-art research. The problem modeling in this research is sequential tagging for a set of input token sequences. Several sequential tagging methods will be used as a comparison method in this proposed approach, namely Maximum Entropy, Decision Tree, and Naïve Bayes. In addition to using several traditional non-Deep Learning models, the performance of the proposed method is also compared with previous studies using Bi-LSTM by Purwanto et al. [21]. The experimental results can be seen in Table 2. Based on the experimental results in previous studies, the best performance of the Bi-LSTM proposed in this study provides the most significant increase of approximately 13% compared to other approaches that do not use Deep Learning. However, compared with the Bi-LSTM that has been proposed by [21], there was an increase of approximately 2%. The reason is that the results of the proposed approach are using two labels and while in [21] approach uses four labels. The use of two labels can give the best results compared to 4 labels in previous studies, especially in sentence boundary detection research.

IV. Conclusion
We have done several experiments to prove the capability of Bidirectional LSTM as the sequence learner to solve sentence boundary detection. We view this task as a sequential problem where every token input is predicted to end a sentence. Based on our experiments, we could reach 98.49% F1 score with Bidirectional LSTM as our sequence learner and train embedding for the word embedding as the best model. We also compare our research with other widely used methods in sequence classification. We conclude that the Bidirectional LSTM is way better than a Unidirectional LSTM. In our case, word2vec does not effectively capture sentence boundaries for Indonesian news documents. Our last trial gives a similar F1 score, whether using low dimension or high dimension embedding size.