Indonesian Sentence Boundary Detection using Deep Learning Approaches

Joan Santoso, Esther Irawati Setiawan, Christian Nathaniel Purwanto, Fachrul Kurniawan

Abstract


Detecting the sentence boundary is one of the crucial pre-processing steps in natural language processing. It can define the boundary of a sentence since the border between a sentence, and another sentence might be ambiguous. Because there are multiple separators and dynamic sentence patterns, using a full stop at the end of a sentence is sometimes inappropriate. This research uses a deep learning approach to split each sentence from an Indonesian news document. Hence, there is no need to define any handcrafted features or rules. In Part of Speech Tagging and Named Entity Recognition, we use sequence labeling to determine sentence boundaries. Two labels will be used, namely O as a non-boundary token and E as the last token marker in the sentence. To do this, we used the Bi-LSTM approach, which has been widely used in sequence labeling. We have proved that our approach works for Indonesian text using pre-trained embedding in Indonesian, as in previous studies. This study achieved an F1-Score value of 98.49 percent. When compared to previous studies, the achieved performance represents a significant increase in outcomes.

.


Full Text:

PDF

References


D. Jurafsky and H. James, Martin: Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Prentice-Hall, Englewood Cliffs, 2008.

J. Read, R. Dridan, S. Oepen, and L. J. Solberg, “Sentence boundary detection: A long solved problem?,” in Proceedings of COLING 2012: Posters, 2012, pp. 985–994.

D. J. Walker, D. E. Clements, M. Darwin, and J. W. Amtrup, “Sentence boundary detection: A comparison of paradigms for improving MT quality,” in Proceedings of the MT Summit VIII, 2001, vol. 58.

Y. Liu, A. Stolcke, E. Shriberg, and M. Harper, “Comparing and combining generative and posterior probability models: Some advances in sentence boundary detection in speech,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004, pp. 64–71.

Y. Liu, A. Stolcke, E. Shriberg, and M. Harper, “Using conditional random fields for sentence boundary detection in speech,” in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005, pp. 451–458.

B. Roark et al., “Reranking for sentence boundary detection in conversational speech,” in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2006, vol. 1, pp. I--I.

J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz, “Multi-document summarization by sentence extraction,” in Proceedings of the 2000 NAACL-ANLP Workshop on Automatic summarization, 2000, pp. 40–48.

E. Y. Hidayat, F. Firdausillah, K. Hastuti, I. N. Dewi, and A. Azhari, “Automatic text summarization using latent Drichlet allocation (lda) for document clustering,” Int. J. Adv. Intell. Informatics, vol. 1, no. 3, pp. 132–139, 2015.

D. Rudrapal, A. Jamatia, K. Chakma, A. Das, and B. Gambäck, “Sentence Boundary Detection for Social Media Text,” in Proceedings of the 12th International Conference on Natural Language Processing, 2015, pp. 254–260.

X. Chang and Q. Zheng, “Offline definition extraction using machine learning for knowledge-oriented question answering,” in International Conference on Intelligent Computing, 2007, pp. 1286–1294.

R. Zhang and C. Zhang, “Dynamic Sentence Boundary Detection for Simultaneous Translation,” Proceedings of the First Workshop on Automatic Simultaneous Translation, 2020.

T. A. Le, “Sequence labeling approach to the task of sentence boundary detection,” in ACM International Conference Proceeding Series, Jan. 2020, pp. 144–148, doi: 10.1145/3380688.3380703.

N. Sadvilkar and M. Neumann, “PySBD: Pragmatic Sentence Boundary Disambiguation,” Oct. 2020, [Online]. Available: http://arxiv.org/abs/2010.09657.

T. Kiss and J. Strunk, “Unsupervised multilingual sentence boundary detection,” Comput. Linguist., vol. 32, no. 4, pp. 485–525, 2006.

J. Wang, Y. Zhu, and Y. Jin, “A rule-based method for Chinese punctuations processing in sentences segmentation,” in 2014 International Conference on Asian Language Processing (IALP), 2014, pp. 195–198.

J. C. Reynar and A. Ratnaparkhi, “A maximum entropy approach to identifying sentence boundaries,” in Proceedings of the fifth conference on Applied natural language processing, 1997, pp. 16–19.

B. Jurish and K.-M. Würzner, “Word and Sentence Tokenization with Hidden Markov Models.,” JLCL, vol. 28, no. 2, pp. 61–83, 2013.

K. Tomanek, J. Wermter, and U. Hahn, “Sentence and token splitting based on conditional random fields,” in Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, 2007, vol. 49, p. 57.

Y. Akita, M. Saikou, H. Nanjo, and T. Kawahara, “Sentence boundary detection of spontaneous Japanese using statistical language model and support vector machines,” 2006.

D. Hillard, M. Ostendorf, A. Stolcke, Y. Liu, and E. Shriberg, “Improving automatic sentence boundary detection with confusion networks,” in Proceedings of HLT-NAACL 2004: Short Papers, 2004, pp. 69–72.

C. N. Purwanto, A. T. Hermawan, J. Santoso, and Gunawan, “Distributed Training for Multilingual Combined Tokenizer using Deep Learning Model and Simple Communication Protocol,” in 2019 1st International Conference on Cybernetics and Intelligent System (ICORIS), 2019, vol. 1, pp. 110–113.

D. Gillick, “Sentence boundary detection and the problem with the US,” in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, 2009, pp. 241–244.

C. N. Silla and C. A. A. Kaestner, “An analysis of sentence boundary detection systems for English and Portuguese documents,” in International Conference on Intelligent Text Processing and Computational Linguistics, 2004, pp. 135–141.

C.-E. González-Gallardo and J.-M. Torres-Moreno, “Sentence boundary detection for French with subword-level information vectors and convolutional neural networks,” arXiv Prepr. arXiv1802.04559, 2018.

H. P. Le and T. V. Ho, “A maximum entropy approach to sentence boundary detection of Vietnamese texts,” 2008.

N. Xue and Y. Yang, “Chinese sentence segmentation as comma classification,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Jun. 2011, pp. 631–635, [Online]. Available: https://www.aclweb.org/anthology/P11-2111.

K. Shitaoka, K. Uchimoto, T. Kawahara, and H. Isahara, “Dependency Structure Analysis and Sentence Boundary Detection in Spontaneous Japanese,” in Proceedings of the 20th International Conference on Computational Linguistics, 2004, pp. 1107–es, doi: 10.3115/1220355.1220514.

N. Wanjari, G. M. Dhopavkar, and N. B. Zungre, “Sentence Boundary Detection For Marathi Language,” Procedia Comput. Sci., vol. 78, pp. 550–555, 2016, doi: https://doi.org/10.1016/j.procs.2016.02.101.

D. N and R. K. P, “Article: Sentence Boundary Detection in Kannada Language,” Int. J. Comput. Appl., vol. 39, no. 9, pp. 38–41, Feb. 2012.

C.-E. González-Gallardo, E. L. Pontes, F. Sadat, and J.-M. Torres-Moreno, “Automated Sentence Boundary Detection in Modern Standard Arabic Transcripts using Deep Neural Networks,” Procedia Comput. Sci., vol. 142, pp. 339–346, 2018, doi: https://doi.org/10.1016/j.procs.2018.10.485.

Z. Rehman, W. Anwar, and U. I. Bajwa, “Challenges in Urdu Text Tokenization and Sentence Boundary Disambiguation,” in Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP), Nov. 2011, pp. 40–45, [Online]. Available: https://www.aclweb.org/anthology/W11-3007.

S. Sirirattanajakarin, D. Jitkongchuen, and P. Intarapaiboon, “BoydCut: Bidirectional LSTM-CNN Model for Thai Sentence Segmenter,” Sep. 2020, doi: 10.1109/IBDAP50342.2020.9245454.

S. J. Putra, M. N. Gunawan, I. Khalil, and T. Mantoro, “Sentence boundary disambiguation for Indonesian language,” in ACM International Conference Proceeding Series, Dec. 2017, pp. 587–590, doi: 10.1145/3151759.3156474.

S. Raharjo, R. Wardoyo, and A. E. Putra, “Rule Based Sentence Segmentation of Indonesian Language,” J. Eng. Appl. Sci., vol. 13, no. 21, pp. 8986–8992, 2018.

“Siapa Calon Pimpinan KPK yang Akan Dipilih DPR?,” Nov. 14, 2011. https://news.detik.com/berita/d-1766855/siapa-calon-pimpinan-kpk-yang-akan-dipilih-dpr (accessed Aug. 09, 2021).

“10 Destinasi Terbaik Asia 2018 Versi Lonely Planet, Ada Komodo,” Jul. 13, 2018. https://travel.detik.com/travel-news/d-4113452/10-destinasi-terbaik-asia-2018-versi-lonely-planet-ada-komodo (accessed Aug. 09, 2021).

T. Kiss and J. Strunk, “Viewing sentence boundary detection as collocation identification,” in Proceedings of KONVENS, 2002, vol. 2002, pp. 75–82.

D. P. Kingma and J. L. Ba, “Adam: a Method for Stochastic Optimization,” Int. Conf. Learn. Represent. 2015, 2015.




DOI: http://dx.doi.org/10.17977/um018v4i12021p38-48

Refbacks

  • There are currently no refbacks.


Copyright (c) 2021 Knowledge Engineering and Data Science

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Flag Counter

Creative Commons License


This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

View My Stats