Comparison of Machine Learning Algorithms for Species Family Classification using DNA Barcode

Classifying plant species within the Liliaceae and Amaryllidaceae families presents inherent challenges due to the complex genetic diversity and overlapping morphological traits among species. This study explores the difficulties in accurate classification by comparing 11 supervised learning algorithms applied to DNA barcode data, aiming to enhance the precision of species family classification in these taxonomically intricate plant families. The ribulose-1,5-bisphosphate carboxylase-oxygenase large sub-unit (rbcL) gene, selected as a DNA barcode locus for plants, is used to represent species within the Amaryllidaceae and Liliaceae families. The experimental results demonstrate that nearly all tested models achieve accurate species classification into the appropriate families, with an accuracy rate exceeding 97%, except for the Naïve Bayes model. Regarding computational time, the Random Forest model requires significantly more time for training than other models. Regarding memory usage, the Least Squares Support Vector Machine with a polynomial kernel, and Regularized Logistic Regression consume more memory than other models. These machine learning models exhibit strong concordance with NCBI's classifications when predicting families using the test dataset, effectively categorizing species into the Amaryllidaceae and Liliaceae families


I. Introduction
The development of living specimen processing technology [1] in recent decades has created many biological data, including Deoxyribonucleic Acid (DNA) sequence data.The collection of DNA sequences starts with taking samples from living organisms.The sample is then processed through various stages such as extraction, enumeration, and amplification to obtain pieces of DNA.These DNA fragments are then collected and sequenced to obtain the nucleic acid symbols (such as adenine (A), guanine (G), cytosine (C), and thymine (T)), which compose the DNA sequence [2].The pieces of DNA sequences are then analyzed to obtain a genome that has been restructured so that it becomes a complete genome.That part of the genome is then selected as a barcode representing the species [3] [4].All these stages are depicted in Figure 1.
It has long been known that DNA sequences can be used to identify species, and nowadays, this activity is better known as DNA barcoding [5] [6].DNA barcoding is a method for identifying unknown specimens.It sequences in certain gene regions/loci that represent species in each kingdom, namely: cytochrome C Oxidase subunit I (COI) for animals [7] obtained from mitochondria in cells,

ARTICLE INFO
A B S T R A C T ribulose-1,5-bisphosphate carboxylase -oxygenase large sub-unit (rbcL) and megakaryocyteassociated tyrosine kinase (matK) for plants [8] obtained from chloroplast cells, and internal transcribed spacer (ITS) for fungi [9] found in nucleus cells.The process of identifying species in DNA barcoding is done by analyzing the similarity of a barcode belonging to a specimen with another barcode belonging to a species already known in the database.The specimen can be classified as an existing species if the barcode has a high degree of similarity.If no barcode pairs are found with a high degree of similarity, then the specimen may be a new species and needs to be verified by a taxonomist.
Several approaches are commonly used to classify species in DNA barcodes: tree-based, similarity-based, and character-based [10] [11].The tree-based method classifies a barcode into species based on its membership in the DNA barcode tree.The similarity-based method classifies barcodes based on the number of similar characters in the DNA barcode.At the same time, the character-based method relies on the presence or absence of specific characters in the DNA barcode.In addition to these three approaches, species classification using DNA barcodes can also be treated as a case of machine learning problems with supervised learning [12][13] [14][15] [16].
The Liliaceae family, colloquially called the 'Lily Family', predominantly consists of monocotyledonous plants characterized by notable morphological diversity.Encompassing approximately 16 genera and over 610 species [17], members of this family manifest primarily as herbs and shrubs.They are predominantly distributed across temperate and subtropical regions [18].The amphipathic properties inherent to certain compounds within Liliaceae render them effective as surfactants.Beyond their ecological significance, these plants exhibit multifaceted utility: they are esteemed for ornamental purposes and utilized as vegetables, and certain species are acknowledged for their medicinal properties.Given the vast potential inherent to the Liliaceae family, they hold promise in cosmetics and pharmaceutical development [19].
The Amaryllidaceae family, a prominent member of the order Asparagales, is distinguished by its bulbous flowering plants.These plants are celebrated for their visually captivating flowers, making them famous for ornamental cultivation [20].From a taxonomic perspective, the Amaryllidaceae family is stratified into three subfamilies: Agapanthoideae, Allioideae, and Amaryllidoideae [21].Historically, these were regarded as distinct families.The term "Amaryllidaceae" is recurrently cited in phytochemical and pharmaceutical literature, particularly in discussions centered on the Amaryllidoideae subfamily [20] [22].
The medicinal potential of the Amaryllidaceae family is both historical and contemporary.Tracing back to the Classical period, luminaries like Hippocrates and Dioscorides harnessed the therapeutic properties of Narcissus oil, particularly for conditions believed to be associated with uterine tumors.In modern traditional medicine, the applications are diverse.For instance, Ammocharis is employed for blood purification and wound treatment, Brunsvigia for respiratory and hepatic ailments, Clivia for snakebites and facilitation of childbirth, and Crinum for a spectrum of conditions ranging from tumors to rheumatism [23].
In previous research, the Amaryllidaceae family was classified under the Liliaceae family.However, advancements in phylogenetics have led to a taxonomic reorganization.A team of scientists, spearheaded by Rolf Dahlgren [24], extensively examined monocot characteristics, including numerous microscopic features, culminating in a revised classification.
Historically, taxonomic experts such as Bentham and Hooker [25], Engler and Prantl [26], Bessey [27], Rendle [28], and Hutchinson [29] categorized Amaryllidaceae with an inferior ovary and Liliaceae with a superior ovary into distinct families based on ovary position differences.Despite these distinctions, both families exhibited numerous shared characteristics.Consequently, Cronquist [30] and Takhtajan [31] integrated the Amaryllidaceae family into Liliaceae.Further research regarded 'lilies' as a heterogeneous collection of genera and positioned them in families grouped under two orders: Asparagales and Liliales [32].
The problem in both families is depicted in the classification of Allium albopilosum.Allium albopilosum, indigenous to Turkestan, is cultivated for its notable utility as a cut flower.While traditionally, Allium species have been categorized under the Liliaceae family due to the presence of superior ovaries in their flowers, there exists a divergence of opinion among botanists.Some propose their reclassification to the Amaryllidaceae family, citing the characteristic umbellate inflorescence.Conversely, others advocate for a distinct classification, suggesting establishing a unique family, Alliaceae, to accommodate them [33].
The Consortium Barcode of Life [8] advocated the rbcL gene as a barcode for plant taxonomy and phylogenetic analysis.This gene is pivotal in plant species identification, phylogenetics, and relationships.The rbcL gene is located in chloroplast DNA [8].Several studies have employed the rbcL gene for plant relationship research.For instance, the rbcL gene elucidates the relationships within Selaginellaceae [34].Similarly, another research combined the rbcL gene with trnL-F for a phylogenetic study on Rhamnaceae [35].
Machine learning is a study attempting to extract knowledge from available data using computer programs that can learn and get smarter automatically based on experience [36] [37].Currently, the application of machine learning can be found in various activities in everyday life, such as recommendations for goods in Amazon e-commerce services [38], recommendations on the music streaming platform Spotify [39], and recommendations in education assessment [40][41] [42]In bioinformatics, machine learning has been widely used to solve problems in various areas, including genomics, proteomics, systems biology, evolution, microarrays, and text mining [43][44] [45].The application of machine learning in each case handles the different characteristics of the input data.
Based on the type of feedback from the input data, there are three forms of learning: supervised learning, unsupervised learning, and reinforcement learning [46].Of the three forms of machine learning, bioinformatics case studies generally use supervised learning and unsupervised learning to solve problems.For example, supervised learning is used in genomics for the case of gene finding [47].Another example is the application of Support Vector Machines (SVM) [48] and Random Forests (RF) [49] for the prediction of phenotypic effects [50].An example of the application of unsupervised learning in bioinformatics is microarray science for clustering genes into groups with specific biological meanings [51].
This study attempts to compare supervised machine learning algorithms to predict families of species based on DNA barcode sequences in the R programming language.By predicting the family, we can more accurately place the species in the correct family in the taxonomy.Machine learning algorithms that are used in this research are Random Ferns, SVM Linear, SVM Poly, SVM Radial, SVM Radial Weights, LSSVM Poly, Naïve Bayes, Random Forest, C5.0, K-Nearest Neighbours, and Regularized Logistic Regression.
The DNA barcode sequence employed in this study is derived from a segment of the chloroplast gene specific to the rbcL gene region of each examined species.This research contributes to resolving the existing classification ambiguity between the Liliaceae and Amaryllidaceae families.It accomplishes this by applying various machine learning methodologies, the results of which are juxtaposed with contemporary, state-of-the-art classification systems from NCBI to yield more definitive insights into the precise familial categorizations.

A. Data Collection
The data used are DNA barcode sequence data obtained from GenBank [52] (ncbi.nlm.nih.gov,accessed August 15, 2023).The dataset contains rbcL enzyme sequences from the chloroplast gene of plants in the Amaryllidaceae and Liliaceae families.Information on the number of species, sequences, and file size of each dataset is listed in Table 1.The Amaryllis dataset contains 802 samples from the Amaryllidaceae family, of which 689 were used for training and 113 for testing.Meanwhile, the Lily dataset comes from the Liliaceae family and contains 853 samples, with details of 713 used for training and 140 for testing.All sequences in the dataset have varying sequence lengths (base pair; bp), with the most extended sequence having 1,458bp and an average sequence length of 903bp.
The training dataset was obtained by downloading all species sequences in the family and omitting several selected species in the Amaryllidaceae and Liliaceae families.The complete list of species omitted from the training dataset can be seen in Table 2.The testing dataset is a sequence of species omitted from the training dataset.The difference in the number of species in the testing dataset in Table 1 with the species in Table 2 is due to (1) not all species have samples of the rbcL gene sequence in GenBank at the time of data collection (example: Allium chrysanthum) and (2) GenBank distinguishes main species from varieties/sub-species (example: Crinum asiaticum and Crinum asiaticum var.Japonicum).All species collected in the testing dataset are listed in Table 3.
The entire dataset is downloaded and saved in FASTA format.Figure 2 shows an example of dataset content containing the GenBank accession number, species name, sequence description, and DNA sequence.Each sequence is indicated by a line starting with the greater than symbol (">") and ending with a blank line.

B. Computational Model
The computational model used in this study is depicted in Figure 3.This study uses the R programming language R version 4.2.1, which is run on a computer with an eight-core CPU using an Intel Core i5-1135G7 processor with a frequency of 2.4 GHz, RAM with a capacity of 16GB and 512GB Solid-State Disk (SSD).Several stages use the package libraries available in the public repository CRAN and Bioconductor.However, preparatory steps are still being taken to use the package according to research needs.Furthermore, each stage in the computational model of this research will be explained as follows.The first is to retrieve the training/testing dataset.All data are downloaded using the program code with the help of the rentrez package [53].First, a filter query was made to search for DNA sequences that matched the following criteria: (1) members of the Amaryllidaceae and Liliaceae families, (2) more than 450 bp and less than 10,000 bp in length, (3) excluding species excluded from training data or only species selected for data testing, and ( 4) is the rbcL gene.The search results are used to download the whole sequence in FASTA format.A series of pre-processed data stages are carried out to use DNA sequences in the classification model.The pre-processing stage starts from the DNA Sequence Parsing stage to Family Labeling.
The second is DNA Sequence Parsing.At this stage, sequences in FASTA format are converted to the DNAStringSet format with the help of the Biostrings package [54].The results of the sequence conversion in this stage are exemplified in Figure 4.The third is sequence alignment.Datasets are combined and processed so that the symbols in the sequences are arranged between each sequence to have the same length.Sequence Alignment is run using the Multiple Sequence Alignment (MUSCLE) algorithm with the help of the muscle package [55].
Fourth, aligned sequence parsing.The sequence alignment results are then converted to DNAbin format with the help of the ape (Analyses of Phylogenetics and Evolution) package [56] so that it can be read by the package used in the next stage.
Fifth is sequence trimming.The next step is to perform Sequence Trimming on the existing sequences so that there are no gap symbols in each sequence's upstream (left end) and downstream (right end).The sequences were trimmed with the help of the IPS (Interfaces to Phylogenetic Software) package [57] until 99% of the sequences had no gaps upstream downstream.Figure 5 shows an example of DNA sequence data before and after sequence alignment.Sixth, conversion to the data frame.Furthermore, the sequence conversion from the Sequence Trimming results is carried out into the data frame structure.It is the fundamental format commonly used in the R programming language.Each symbol in the sequence is converted to a column with the character data type (character; chr).The DNA representation in the data frame is shown in Figure 6.Eight casts DNA bases into factor.Each column containing the DNA sequence symbol in the data frame is then cast to an unordered factor data type with five levels.Each of these levels represents a gap symbol and a nucleobase in the DNA sequence, "-", "a", "c", "g", and "t".Gaps replace other nucleobase symbols that have ambiguous properties.
Ninth is family labeling.The training data frame is then added to a new column filled with family labels according to the data from GenBank, while the data frame testing added a new column for the family but with empty data.
The next is one-hot encoding.The data is transformed into a numeric representation in this stage, facilitating its subsequent processing.Precisely, each character that represents nucleobases-namely "a", "c", "g", "t", or "-" derived from the alignment process, is mapped to a five-column matrix.Within this matrix, the column corresponding to the specific nucleobase character is assigned a value of 1, while the remaining columns are assigned a value of 0, as illustrated in Figure 7 [58].A validation process is also carried out to ensure the model is not overfitting or underfitting through a cross-validation process with the help of a caret package [59].Parallel [60] and doParallel [61] packages speed up the cross-validation resampling.The foreach package [62] also turns off parallel compute mode.The model used in this experiment including: C5.0, Knn (k-Nearest Neighbors), lssvmPoly (Least Squares Support Vector Machine with a polynomial kernel), naive_bayes, regLogistic (Regularized Logistic Regression), rf: (Random Forest), rFerns (Random Ferns), svmLinear (Support Vector Machines with Linear Kernel), svmPoly (Support Vector Machines with Polynomial Kernel), svmRadial (Support Vector Machines with Radial Basis Function Kernel), and svmRadialWeights (Support Vector Machines with Class Weights).
The next one is prediction.Class prediction is carried out on the data frame testing based on the model made in the previous stage.
The last is evaluation.The prediction results of the classification models are then evaluated based on the level of accuracy concerning the family label of each sequence in GenBank and the results of the sequence consensus made using the DECIPHER package [63].Duration and memory used when training the model are measured using the profvis package [64].

III. Results and Discussions
This study used rbcL gene sequence data from species in the Amaryllidaceae and Liliaceae families obtained from GenBank.Each species in the dataset has more than one sample because each sequence comes from sequencing results in different locations.All dataset downloads are performed using program code.For example, downloading the Amaryllis training dataset starts by searching GenBank using the entrez_search function from the rentrez package in the following program code.The argument for the term parameter is a variable that contains the search query.
search_result <-rentrez::entrez_search( db = "nuccore", term = query, retmax = limit use_history = TRUE ) The search results are then used to download the sequences using the entrez_fetch function from the rentrez package.Iterations are carried out with 50 steps until the DNA sequences from the search results are entirely downloaded.

Initialize sequences as an empty string Set chunk_size to 50
Calculate num_iterations based on the total number of ids in search_result divided by chunk_size, rounded up For each iteration i from 1 to num_iterations: Calculate start_idx based on the current iteration and chunk_size Calculate end_idx as the smaller value between i times chunk_size and the total number of ids in search_result Extract a subset of ids from search_result between start_idx and end_idx and store in current_ids Use current_ids to fetch sequence data from nuccore database and store the result in fetchRes If fetchRes is not found in sequences: Append fetchRes to sequences The downloaded result is then exported into a file using the write function.Detailed information about the datasets successfully downloaded and used in this study has been presented in the Datasets section.
write(amaryllis_train, file = "amaryllis_train.fasta")After all datasets have been downloaded, the next step is to carry out a series of data and experiments in the pre-processing stages.The configuration of this experimental scenario is shown in Table 4.All the data used went through the pre-processing stage to the exact sequence alignment.After that step, the difference is started by setting the sequence trimming threshold, which will affect the length of the resulting sequence after the sequence trimming stage.A resampling method was used for each configuration combination using ten iterations of 10-fold cross-validation.One of the steps in the pre-processing stage of the data in this study is to perform sequence alignment.The MUSCLE algorithm is used with the help of the muscle library package.The DNA sequence data from the training and testing datasets combined and converted to the DNAStringSet data type are given as arguments to the muscle function.After sequence alignment, each sequence has a length of 3050bp and is stored in DNAMultipleAlignment format.The aligned sequences are then trimmed using the trimEnds function from the IPS package.At this stage, you can set a minimum threshold for the number of columns that do not have gaps in the sequence_trimming_threshold variable.The DNA sequence converted to the DNAbin data type is given as the first argument of this function.After going through the sequence trimming process with a 99% threshold configuration, the sequence has a length of 1192 bp.
arsed_aligned_dataset <-ape::as.DNAbin(aligned_dataset) trimmed_dataset <-ips::trimEnds(parsed_aligned_dataset, trim_at_least) The DNA sequences are then converted to data frame data types and separated again into training and testing sets.The separation is done based on each dataset's sequence labels (row names) before being combined.After splitting, the test data frame contains 220 sequence lines.The training data frame contains 1,402 sequence rows.It can be noted that the threshold set in the sequence trimming process affects the resulting sequence data.The larger the set threshold, the shorter the sequence will be and cause the sequence data between one specimen and another to have the same content.
Furthermore, the conversion of base symbols in the data sequences in the data frame is carried out into a factor data type.The factor function encodes the vector data type to the factor data type and is used as the second argument of the lapply function.The following argument in the lapply function is the argument to the function specified in the second parameter.It is specified that five levels represent gaps and nucleotide base symbols in anonymous vectors for the levels parameter.After the conversion, the nucleotide base symbol other than the four symbols will be changed to NA and replaced with a gap symbol.

For each column in train_df:
Convert the column to a factor with levels "-", "a", "c", "g", "t" Replace the original column with the converted column In the training data frame, a new column is added for the family label based on the information obtained from GenBank.Labeling is done based on data sequence labels (row names) in each training dataset (Amaryllis and Lily) before being combined.After this step, the data is transformed into a numeric representation using one-hot encoding.
Function oneHotDNA(df, n = ["A", "C", "G", "T", "-"]): Convert all elements in df to uppercase Initialize seq_col as the number of columns in df Initialize seq_row as the number of rows in df Create an empty matrix seq_mat with dimensions (seq_row, seq_col * length(n)) Initialize an empty list column_names For each column i from 1 to seq_col: For each row j from 1 to seq_row: If the element at (j, i) in df is in n: Find the position of the element in n Set the corresponding position in seq_mat to 1 For each element j in n: Append (j + "-" + i) to column_names Set the column names of seq_mat to column_names

Return seq_mat
Furthermore, the classification models and the parameters obtained from the random search process are made.The first argument is the formula for the attribute, and the following is the data source used.The third argument is a model that is being used, and this is a model that has been made previously with the parameters obtained from the random search process.The last two arguments are the configurations for Caret's train function and the maximum number of tuning parameter combinations that will be generated.The resulting model is then stored in the model variable.The tilde operator (~) is used to define the model formulae.The left side of the operator is interpreted as the result data of the function, and the right side is the function input.In this case, the formulae Family ~. means that the Family column is modeled as a function of all data other than the Family column.Classification models can be created directly using the default parameters.However, it cannot be known whether the model is the best as it was created once.A resampling step with the crossvalidation method was carried out using a caret package to verify the performance of the classification model.In this study, a 10-fold cross-validation method was used with ten repetitions.Caret supports creating classification models and cross-validation from other package containing machine learning models such as C5.0, kknn, LiblineaR, naivebayes, rFerns, randomForest, and kernlab.Below is an example of creating a random fern model.After the model definition is completed, a cross-validation configuration is prepared using the trainControl function.In practice, this resampling process takes a relatively long time.The resampling process is configured to be done in parallel by utilizing all CPU cores available to speed it up.By default, caret has configured the resampling process to run in parallel.However, configuring and initializing socket clusters is still necessary to perform parallel processing.The makePSOCKcluster function from the parallel package is used with the parameter number of cores to be used, in this case, using eight cores.It is then configured to run operations in parallel using the registerDoParallel function from the doParallel package.Validation of socket cluster creation can be done with the showConnections function.The outcomes of the classification algorithms are delineated in Table 5, where optimal hyperparameters were ascertained through a 10-fold cross-validation technique.Notably, the C5.0, Regularized Logistic Regression, Random Forest, SVM linear, SVM poly, SVM radial, and SVM radial weights yielded exceptional classification accuracy at 99.85%.In contrast, the Naïve Bayes algorithm emerged as the least efficacious model, registering a mere 53.2% accuracy rate.A graphical representation of the accuracy metrics accrued during the training phase is furnished in Figure 8.
From Figure 8, we can see that the model's accuracy is relatively high, except for Naïve Bayes.The rest of the model obtained more than 97% accuracy.Whereas for the computing power, we monitor memory usage and the computational cost for each model.The result can be seen in Table 6.
In Table 6, we present a comparative analysis of computational efficiency, quantified in terms of time complexity and memory utilization across various machine learning models.The Random Ferns algorithm exhibits superior computational performance, requiring a mere 24.67 seconds for execution.Naïve Bayes, SVM follows this with a linear kernel, and Regularized Logistic Regression, which necessitate computational durations of 41.14 seconds, 45 seconds, and 77.76 seconds, respectively.Conversely, the Random Forest algorithm demonstrates the least computational efficiency, consuming a substantial 1 hour, 12 minutes, and 51.75 seconds for its computational tasks.In the context of memory utilization, the Naïve Bayes algorithm demonstrates the most efficient memory footprint, consuming a mere 863.9 megabytes (Mb), but in return, this model performs poorly in the experiment.This result is followed by k-Nearest Neighbors, Random Forest, Random Ferns, and Support Vector Machines with a linear kernel, which exhibits memory usage within the 1450 to 1650 Mb range.Conversely, the Least Squares Support Vector Machine (LSSVM) employing a polynomial kernel and Regularized Logistic Regression algorithms manifest significantly elevated memory consumption, requiring 6511.4Mb and 4419.8Mb, respectively.The visual representations of these memory utilization metrics are provided in Figure 9 and Figure 10.After the classification model, predictions are made on the testing data using the model.The predict function is used with the training models as the first argument and the testing data frame as the second.The output of the prediction function is saved to the prediction variable.
prediction$results <-predict(model, PredictData) Figure 11 illustrates the outcomes of various algorithms applied in predicting the family of species within the test dataset.This evaluation was conducted by executing each Machine Learning model to predict classifications of the ambiguous data, followed by a comparison with the labels provided by NCBI.A prediction was deemed accurate if it aligned with the NCBI labels.Remarkably, the consistency between these results and the accuracy observed during the training phase suggests the reliability of NCBI's classification of the contentious data.
For species in the Amaryllidaceae family (according to NCBI), almost all algorithms consistently predict Amaryllidaceae, aligning perfectly with the NCBI consensus.This suggests that these algorithms are agreed for classifying species into the Amaryllidaceae family.There are a few anomalies, such as Lilium bulbiferum subsp.croceum, where most algorithms predict Amaryllidaceae instead of Liliaceae, diverging from the NCBI consensus and most algorithmic predictions.However, there is some variability in the predictions.For example, the algorithm LSSVM poly and naïve bayes occasionally predict species in the Amaryllidaceae family as belonging to the Liliaceae family.This indicates that while these algorithms are generally accurate, there may be specific cases where they diverge from the consensus.The algorithms agree with the NCBI consensus regarding the Liliaceae family, predicting Liliaceae for all species listed under this family.
The machine learning algorithms largely agree with the NCBI consensus, demonstrating their effectiveness in classifying species into the Amaryllidaceae and Liliaceae families.Almost all machine learning algorithms gain an accuracy of 98%, except knn with 96% and naïve bayes with 65% accuracy.However, the algorithms diverge a few instances, suggesting areas for further research and model refinement.Overall, the table is a valuable resource for evaluating the performance and reliability of various machine-learning algorithms in plant taxonomy.

IV. Conclusions
Almost all of the models compared in this study were able to classify the DNA Barcode data using the rbcL gene with reasonable accuracy with more than 97% accuracy, except for the Naïve Bayes model with just 53% accuracy.From the results of the resampling process using ten iterations of 10fold cross-validation, we get that the most accurate model, namely C5.0, Regularized Logistic Regression, Random Forest, SVM linear, SVM poly, SVM radial, and SVM radial weights yielded exceptional classification accuracy at 99.85%.Regarding computational time, the most exhaustive model is Random Forest and the least exhaustive model is Random Ferns, which only uses 24.67 seconds of computing time.In terms of memory used by the model, the LSSVM that uses a polynomial kernel model and Regularized Logistic Regression gain the highest memory usage at 6511.4 Mb and 4419.8Mb, respectively.In contrast, Naïve Bayes gets the least computing power, but the model's accuracy is less significant than other models.While predicting the family using a test dataset, the machine learning models align highly with the NCBI's classifications, effectively categorizing species into the Amaryllidaceae and Liliaceae families.Nonetheless, some discrepancies exist, indicating the need for additional research and model improvement.

Fig. 1 .
Fig. 1.Process of processing living specimens into DNA barcodes

Fig. 3 .
Fig. 3. Computational model of comparison of machine learning algorithms for species family classification using DNA barcode

Fig. 4 .
Fig. 4. Conversion of DNA sequences from FASTA format to the DNAStringSet data type

Fig. 5 .
Fig. 5. DNA sequences before and after alignment and trimming

Fig. 6 .
Fig. 6.DNA sequences in the data frame Seventh is a split training and testing set.The data in the data frame are then separated back into the data frame for training and testing.All species sequences whose species names are listed in Table3are separated into a new data frame as a testing data frame.

Table 1 .
Descriptions of the used datasets

Table 2 .
List of species selected for test data

Table 3 .
List of species included in test data

Table 4 .
Experimental scenario and results

Table 5 .
Best parameters obtained from the training process, along with accuracy

Table 6 .
Computational time and memory used in the training process