Comparison of Machine Learning Algorithms for Species Family Classification using DNA Barcode

Lala Septem Riza, M Ammar Fadhlur Rahman, Yudi Prasetyo, Muhammad Iqbal Zain, Herbert Siregar, Topik Hidayat, Khyrina Airin Fariza Abu Samah, Miftahurrahma Rosyda

Abstract


Classifying plant species within the Liliaceae and Amaryllidaceae families presents inherent challenges due to the complex genetic diversity and overlapping morphological traits among species. This study explores the difficulties in accurate classification by comparing 11 supervised learning algorithms applied to DNA barcode data, aiming to enhance the precision of species family classification in these taxonomically intricate plant families. The ribulose-1,5-bisphosphate carboxylase-oxygenase large sub-unit (rbcL) gene, selected as a DNA barcode locus for plants, is used to represent species within the Amaryllidaceae and Liliaceae families. The experimental results demonstrate that nearly all tested models achieve accurate species classification into the appropriate families, with an accuracy rate exceeding 97%, except for the Naïve Bayes model. Regarding computational time, the Random Forest model requires significantly more time for training than other models. Regarding memory usage, the Least Squares Support Vector Machine with a polynomial kernel, and Regularized Logistic Regression consume more memory than other models. These machine learning models exhibit strong concordance with NCBI's classifications when predicting families using the test dataset, effectively categorizing species into the Amaryllidaceae and Liliaceae families.

Full Text:

PDF

References


A. Yang, W. Zhang, J. Wang, K. Yang, Y. Han, and L. Zhang, “Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA,” Front. Bioeng. Biotechnol., vol. 8, p. 1032, Sep. 2020.

S. Behjati and P. S. Tarpey, “What is next generation sequencing?,” Arch. Dis. Child. - Educ. Pract. Ed., vol. 98, no. 6, Art. no. 6, Dec. 2013.

J. Dabney et al., “Complete mitochondrial genome sequence of a Middle Pleistocene cave bear reconstructed from ultrashort DNA fragments,” Proc. Natl. Acad. Sci., vol. 110, no. 39, Art. no. 39, Sep. 2013.

L. Riza, M. Nurfathiya, J. Kusnendar, and K. Abu Samah, “DNA barcoding using particle swarm optimization on apache spark SQL case study: DNA of covid-19,” Int. J. Nonlinear Anal. Appl., vol. 12, no. Special Issue, Art. no. Special Issue, Jan. 2021.

P. D. N. Hebert, A. Cywinska, S. L. Ball, and J. R. deWaard, “Biological identifications through DNA barcodes,” Proc. R. Soc. Lond. B Biol. Sci., vol. 270, no. 1512, pp. 313–321, Feb. 2003.

C. Manwell and C. M. A. Baker, “A sibling species of sea cucumber discovered by starch gel electrophoresis,” Comp. Biochem. Physiol., vol. 10, no. 1, Art. no. 1, Sep. 1963.

P. D. N. Hebert, S. Ratnasingham, and J. R. De Waard, “Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species,” Proc. R. Soc. Lond. B Biol. Sci., vol. 270, no. suppl_1, Aug. 2003.

CBOL Plant Working Group1 et al., “A DNA barcode for land plants,” Proc. Natl. Acad. Sci., vol. 106, no. 31, Art. no. 31, Aug. 2009.

C. L. Schoch et al., “Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi,” Proc. Natl. Acad. Sci., vol. 109, no. 16, pp. 6241–6246, Apr. 2012.

C.-H. Yang, K.-C. Wu, L.-Y. Chuang, and H.-W. Chang, “DeepBarcoding: Deep Learning for Species Classification Using DNA Barcoding,” IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 19, no. 4, pp. 2158–2165, Jul. 2022.

J. Yang et al., “Development of Chloroplast and Nuclear DNA Markers for Chinese Oaks (Quercus Subgenus Quercus) and Assessment of Their Utility as DNA Barcodes,” Front. Plant Sci., vol. 8, p. 816, May 2017.

M. Emu and S. Sakib, “Species Identification using DNA Barcode Sequences through Supervised Learning Methods,” in 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE), Cox’sBazar, Bangladesh: IEEE, Feb. 2019, pp. 1–6.

T. He, L. Jiao, A. C. Wiedenhoeft, and Y. Yin, “Machine learning approaches outperform distance- and tree-based methods for DNA barcoding of Pterocarpus wood,” Planta, vol. 249, no. 5, Art. no. 5, May 2019.

L. Jin, J. Yu, X. Yuan, and X. Du, “Fish Classification Using DNA Barcode Sequences through Deep Learning Method,” Symmetry, vol. 13, no. 9, Art. no. 9, Aug. 2021.

P. K. Meher, T. K. Sahu, and A. R. Rao, “Identification of species based on DNA barcode using k-mer feature vector and Random forest classifier,” Gene, vol. 592, no. 2, pp. 316–324, Nov. 2016.

E. Weitschek, G. Fiscon, and G. Felici, “Supervised DNA Barcodes species classification: analysis, comparisons and results,” BioData Min., vol. 7, no. 1, p. 4, Dec. 2014.

D. Sobolewska, A. Galanty, K. Grabowska, J. Makowska-Wąs, D. Wróbel-Biedrawa, and I. Podolak, “Saponins as cytotoxic agents: an update (2010–2018). Part I—steroidal saponins,” Phytochem. Rev., vol. 19, no. 1, pp. 139–189, Feb. 2020.

P. Nagare and S. S. Shekokar, “A Literature Review Of Some Important Pharmacological Activities Of Few Plants Of Liliaceae Family,” 2022.

P. F. Stevens, “Angiosperm Phylogeny Website. Version 13.,” Angiosperm Phylogeny Website Version 13, 2016.

A. M. Takos and F. Rook, “Towards a molecular understanding of the biosynthesis of Amaryllidaceae alkaloids in support of their expanding medical use,” Int. J. Mol. Sci., vol. 14, no. 6, pp. 11713–11741, 2013.

L. Torras Claveria, L. R. Tallini, F. Viladomat Meya, and J. Bastida Armengol, “Research in natural products: Amaryllidaceae ornamental plants as sources of bioactive compounds,” Recent Adv. Pharm. Sci. VII 2017 Res. Signpost Ed. Diego Muñoz-Torrero Montserrat Riu Carles Feliu ISBN 978-81-308-0573-3 Chapter 5 P 69-82, 2017.

M. W. Chase, J. L. Reveal, and M. F. Fay, “A subfamilial classification for the expanded asparagalean families Amaryllidaceae, Asparagaceae and Xanthorrhoeaceae,” Bot. J. Linn. Soc., vol. 161, no. 2, pp. 132–136, 2009.

A. Kornienko and A. Evidente, “Chemistry, biology, and medicinal potential of narciclasine and its congeners,” Chem. Rev., vol. 108, no. 6, pp. 1982–2014, 2008.

R. M. Dahlgren and H. T. Clifford, The monocotyledons: a comparative study. Academic Press, 1982.

G. Bentham and J. D. Hooker, Genera plantarum :ad exemplaria imprimis in Herberiis Kewensibus servata definita /auctoribus G. Bentham et J.D. Hooker. London, England: A. Black, 1862.

A. Engler, K. Krause, R. Pilger, and K. Prantl, Die Natürlichen Pflanzenfamilien nebst ihren Gattungen und wichtigeren Arten, insbesondere den Nutzpflanzen, unter Mitwirkung zahlreicher hervorragender Fachgelehrten begründet. Leipzig: W. Engelmann, 1887.

C. E. Bessey, “The Phylogenetic Taxonomy of Flowering Plants,” Ann. Mo. Bot. Gard., vol. 2, no. 1/2, p. 109, Feb. 1915.

A. B. Rendle, The classification of flowering plants, no. Vol. 2. Cambridge: Cambridge Univ. Press, 1925.

J. Hutchinson, “Families of Flowering Plants. II. Monocotyledons,” Oxf. Univ. Press, p. 243, 1934.

A. Cronquist, An integrated system of classification of flowering plants. New York: Columbia University Press, 1981.

A. L. Takhtajan, “Outline of the classification of flowering plants (magnoliophyta),” Bot. Rev., vol. 46, no. 3, pp. 225–359, Jul. 1980.

H. Clifford, R. Dahlgren, and P. Yeo, The families of the monocotyledons: structure, evolution, and taxonomy. Springer, 1985.

Y. Mimaki and Y. Sashida, “Steroidal Saponins from the Liliaceae Plants and Their Biological Activities,” in Saponins Used in Traditional and Modern Medicine, G. R. Waller and K. Yamasaki, Eds., in Advances in Experimental Medicine and Biology, vol. 404. Boston, MA: Springer US, 1996, pp. 101–110.

P. Korall and P. Kenrick, “Phylogenetic relationships in Selaginellaceae based on rbcL sequences,” Am. J. Bot., vol. 89, no. 3, pp. 506–517, 2002.

J. E. Richardson, M. F. Fay, Q. C. Cronk, D. Bowman, and M. W. Chase, “A phylogenetic analysis of Rhamnaceae using rbcL and trnL‐F plastid DNA sequences,” Am. J. Bot., vol. 87, no. 9, pp. 1309–1324, 2000.

T. M. Mitchell, Machine Learning. in McGraw-Hill series in computer science. New York: McGraw-Hill, 1997.

A. C. Müller and S. Guido, Introduction to machine learning with Python: a guide for data scientists, First edition. Sebastopol, CA: O’Reilly Media, Inc, 2016.

G. Linden, B. Smith, and J. York, “Amazon.com recommendations: item-to-item collaborative filtering,” IEEE Internet Comput., vol. 7, no. 1, Art. no. 1, Jan. 2003.

K. Jacobson, V. Murali, E. Newett, B. Whitman, and R. Yon, “Music Personalization at Spotify,” in Proceedings of the 10th ACM Conference on Recommender Systems, Boston Massachusetts USA: ACM, Sep. 2016, pp. 373–373.

L. S. Riza, A. D. Pertiwi, E. F. Rahman, M. Munir, and C. U. Abdullah, “Question Generator System of Sentence Completion in TOEFL Using NLP and K-Nearest Neighbor,” Indones. J. Sci. Technol., vol. 4, no. 2, Art. no. 2, Sep. 2019.

L. S. Riza, F. S. Anwar, E. F. Rahman, C. U. Abdullah, and S. Nazir, “Natural Language Processing and Levenshtein Distance for Generating Error Identification Typed Questions on TOEFL,” J. Comput. Soc., vol. 1, no. 1, Art. no. 1, Jun. 2020.

L. S. Riza, R. A. Rosdiyana, A. R. Pérez, and A. Wahyudin, “The K-Means Algorithm for Generating Sets of Items in Educational Assessment,” Indones. J. Sci. Technol., vol. 6, no. 1, Art. no. 1, Jan. 2021.

P. Larrañaga et al., “Machine learning in bioinformatics,” Brief. Bioinform., vol. 7, no. 1, Art. no. 1, Mar. 2006.

L. S. Riza, F. D. Pratama, E. Piantari, and M. Fahsi, “Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming,” TELKOMNIKA Telecommun. Comput. Electron. Control, vol. 18, no. 2, Art. no. 2, Apr. 2020.

L. S. Riza, A. B. Rachmat, Munir, T. Hidayat, and S. Nazir, “Genomic repeat detection using the Knuth-Morris-Pratt algorithm on R high-performance-computing package,” Int J Adv. Soft Compu Appl, vol. 11, no. 1, Art. no. 1, Mar. 2019.

S. J. Russell, P. Norvig, and E. Davis, Artificial intelligence: a modern approach, 3rd ed. in Prentice Hall series in artificial intelligence. Upper Saddle River: Prentice Hall, 2010.

S. Salzberg, “Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm,” J. Comput. Biol., vol. 2, no. 3, Art. no. 3, Jan. 1995.

C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, Art. no. 3, Sep. 1995.

L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, Art. no. 1, 2001.

L. Bao and Y. Cui, “Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information,” Bioinformatics, vol. 21, no. 10, Art. no. 10, May 2005.

A. Gupta, H. Wang, and M. Ganapathiraju, “Learning structure in gene expression data using deep architectures, with an application to gene clustering,” in 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, USA: IEEE, Nov. 2015, pp. 1328–1335.

D. A. Benson et al., “GenBank,” Nucleic Acids Res., vol. 41, no. D1, pp. D36–D42, Nov. 2012.

D. Winter J., “rentrez: An R package for the NCBI eUtils API,” R J., vol. 9, no. 2, p. 520, 2017.

H. Pagès et al., “Biostrings: Efficient manipulation of biological strings.” Bioconductor version: Release (3.17), 2023.

R. C. Edgar, “MUSCLE: multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Res., vol. 32, no. 5, Art. no. 5, Mar. 2004.

E. Paradis and K. Schliep, “ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R,” Bioinformatics, vol. 35, no. 3, Art. no. 3, Feb. 2019.

C. Heibl, “PHYLOCH: R language tree plotting tools and interfaces to diverse phylogenetic software packages.” Jan. 2008.

L. S. Riza, M. I. Zain, A. Izzuddin, Y. Prasetyo, T. Hidayat, and K. A. F. Abu Samah, “Implementation of Machine Learning in DNA Barcoding for Determining the Plant Family Taxonomy,” SSRN Electron. J., 2022.

M. Kuhn et al., “caret: Classification and Regression Training.” Mar. 21, 2023. (Accessed: Jul. 10, 2023)

R Core Team, “R: The R Project for Statistical Computing,” Jul. 10, 2023. (Accessed: Jul. 10, 2023)

F. Daniel, M. Corporation, S. Weston, and D. Tenenbaum, “doParallel: Foreach Parallel Adaptor for the ‘parallel’ Package.” Feb. 07, 2022. (Accessed: Jul. 10, 2023)

F. Daniel, H. Ooi, R. Calaway, Microsoft, and S. Weston, “foreach: Provides Foreach Looping Construct.” Feb. 02, 2022. (Accessed: Jul. 10, 2023)

E. Wright S., “Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R,” R J., vol. 8, no. 1, Art. no. 1, 2016.

W. Chang et al., “profvis: Interactive Visualizations for Profiling R Code.” May 02, 2023. Accessed: Jul. 10, 2023.




DOI: http://dx.doi.org/10.17977/um018v6i22023p231-248

Refbacks

  • There are currently no refbacks.


Copyright (c) 2023 Knowledge Engineering and Data Science

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Flag Counter

Creative Commons License


This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

View My Stats