Non-Gaussian Analysis of Herbarium Specimen Damage to Optimize Specimen Collection Management

Damage to specimen collections occurs in practically every herbarium across the world. Hence

Damage to specimen collections occurs in practically every herbarium across the world. Hence, some precautions must be taken, such as investigating the factors that cause specimen damage in their collections and evaluating their herbarium collection handling and usage policy. However, manual investigation of the causes of herbarium collection damage requires a lot of effort and time. Only a few studies have attempted to investigate the causes of herbarium collection damage. So far, the non-gaussian approach to detecting the causes of damage to herbarium specimens has not been studied before. This study attempted to explore the effect of species type, time, location, storage, and remounting status on the level of damage to herbarium specimens, especially those in the genus Excoecaria. Gaussian modeling is not good enough to model the counted data phenomenon (the amount of damage to herbarium specimens). Negative binomial regression (NBR) provides a better model when compared to generalized Poisson regression and ordinary Gaussian regression approaches. NBR detects non-uniformity in the storage process, causing damage to herbarium specimens. Natural damage to herbarium specimens is caused by differences in species and the origin of specimens. collection over the years. Damage to the herbarium collection can be seen in Figure 1. These circumstances may cause bias in herbarium specimen data and uncertainty in decision-making and study outcomes.
Herbarium Bogoriense (BO) is the largest herbarium center in Southeast Asia and one of the top three in the world. This herbarium collection comprises a comprehensive collection of flowering plants, gymnosperms, ferns and lycophytes, mosses, liverworts, fungi, and many more. Nearly one million specimens from the Malesian region (Indonesia, Singapore, Malaysia, Brunei Darussalam, Timor-Leste, Papua New Guinea, and Philippines) obtained through field expeditions and gifts or exchanges between herbariums around the world [11]. The herbarium specimens, both dry and wet collections, are stored and arranged in the space provided by the curator. Collections are classified according to their respective taxons. The collection is placed separately from the collection of monocots and dicots. Arrangement of collections alphabetically by family, genus, species, and sites. Specimen sheets using acid-free paper, species folders, and genus maps. The placement of type specimens is separated from the general collection [11]. BO, one of the main reference centers for research on tropical plant taxonomy, ecology, ethnobiology, physiology, morphogenetics, and phytochemistry in the Malesian region, must ensure that all its collections are always of good quality and minimize the possibility of damage.
Keeping the herbarium collection in good condition throughout the process, from specimen collecting to storage, was challenging for the curator. In some cases, the herbarium sheet itself represents the plant, as all the plants may be lost in that place. So, protecting the sheets from fungal and insect pests is an important step. After the collection has been preserved, it should be checked regularly to ensure that the plants are healthy and free of insects or excessive dampness. Insects have the potential to destroy herbarium collections. Insects will inevitably attack the species, even with the most meticulous care and the best equipment. The curators also routinely check [12][13] the specimens to see if any specimens are damaged, especially damage caused by fungi or insects. Although preventive measures have been taken to eliminate insects and fungi that could damage the specimens, the curators still found some damaged specimens. The specimens most damaged by insects or fungi were from the genus Excoecaria. So, they took the initiative to investigate the factors that cause the specimens' damage in their collections. Several studies have investigated the damage. Meineke used digital herbarium specimens to study long-term insect-plant interactions [14]. For phenological research, Pearson uses machine learning on digital herbarium specimens [15]. It is a vital strategy to review and evaluate the policy of their herbarium collection handling and usage. However, manual investigation of the causes of herbarium collection damage requires a lot of effort and time. Only a few studies have attempted to investigate the causes of herbarium collection damage. Many metadata-based studies have been carried out before. Studies have been conducted to discover time series patterns and specimen distributions of genetic changes in a specimen. Studies link herbarium specimen metadata to climate change patterns [16][17] [18]. On the other hand, this study looks at how labels on herbarium specimen metadata affect the damage to herbarium specimens.
The curator assesses specimen damage. If the specimen is damaged, the curator will mark the damaged area in the photo and offer details on the source of the damage. The damage marker box size varies and depends on the specimen's damage. One specimen sheet can have several flaws from various sources. Herbarium specimens are damaged in three ways: before processing (BP), inprocessing (IP), and caused by insects. The first category includes damage that occurred before collection (i.e., damage caused by natural forces in nature). The second category includes damage that occurred during the collection or remounting of herbarium specimens (in-process collecting damage). Insect damage is the last type of damage that can occur to herbarium specimens.
Damage identification in a herbarium specimen is based on the number of damaged spots and the source of damage (BP, IP, or insect). Thus, the study's response variable is counted data. So linear regression cannot be used to model the phenomena in this investigation. The Generalized Linear Model (GLM) can model data with non-linear characteristics. GLM modeling requires three essential components: random, systematic, and link functions [19]. Non-linear regression with counted data is achievable using Generalized Poisson and Negative Binomial Regression [20]. Generalized Poisson Regression (GPR) is suitable for modeling with counted data [20]. The Generalized Poisson distribution is used to distribute the response variables in the GPR model (GPD). This GPD can model overdispersion and underdispersion well [20] [21]. Negative Binomial Regression can also be used to model counted data. The negative binomial distribution is a Poisson-Gamma mixed function. It can accommodate overdispersion in Poisson regression because it does not require equidispersion [20] [22].

II. Methods
The stages of analysis in this study are depicted in Figure 2. The first step is the herbarium damage quantification specimen. At this stage, we annotate each type of damage per herbarium specimen. In the second stage, we will evaluate whether the three types of damage are multivariate phenomena (identification through the correlation value of each pair of types of damage). Multivariate modeling will be carried out if there is a significant correlation between each pair of types of damage. Otherwise, univariate modeling will be carried out. The next stage is modeling with non-Gaussian regression. At this stage, modeling two types of non-Gaussian regression (NBR and Poisson) is carried out. As a comparison, Gaussian regression modeling is still being carried out. In the last stage, we will evaluate the model based on the results obtained from the previous stage. AIC parameters are used to evaluate the best type of modeling.

A. Specimen Overview
Recently, the scientific curator of BO reported that his collection was damaged. Several genera were damaged, such as Antidesma, Baccaurea, Breynia, Excoecaria, etc. However, the most damage occurred in the genus Excoecaria. In that genus, curators found 2,146 defects in 175 Excoecaria specimens. It includes damage from nature, damage from mounting or remounting, and damage caused by insects, all types of damage that can happen.
Excoecaria is a genus of plants in the Euphorbiaceae family [23]. Excoecaria is derived from the Latin word excaeco, which means "to blind," and refers to the sap of some plants that can induce temporary blindness [24]. Excoecaria is Shrubs or trees with milky latex, glabrous, monoecious, or dioecious. Leaves alternate with two glands at the petiole-lamina junction. Inflorescences have a spike or raceme with flowers clustered in the axils of bracts; female inflorescences are shorter than males. Perianth segments 2 or 3. Stamens 2 or 3, filaments basally fused. Ovary 2-or 3-locular, solitary ovule in each loculus; style 3, linear, free. 3-lobed capsules. The milky latex irritates the skin and can cause injury and blindness if applied to the eye. Distribution and frequency of occurrence: 40 species worldwide, from tropical Africa to Malaysia and Australia [25]. There can be only one cause of damage on a specimen, but there can also be more than one source of damage. Examples of specimens that suffered damage caused by a single source of damage can be seen in Figure 1.

B. Specimen Herbarium Damage Quantification
We quantified herbivory on a few genus Excoecaria specimens collected in Indonesia, New Guinea, Malaysia, and the Philippines and preserved within the Herbarium Bogoriense. We chose the genus Excoecaria because specimens from the genus Excoecaria were the most damaged in the Herbarium Bogoriense.
The curator assesses specimen damage. If there is damage to the specimen, the curator will put a checkmark on the damaged part in the specimen photo and provide information on the source of the damage. The size of the damaged marker box is not uniform. The size of the damaged marker box depends on the size of the damaged part of the specimen. One specimen sheet can consist of one or more defects with different sources of damage. The causes of damage to herbarium specimens are classified into three categories. First, damage that occurred before the specimen collection process or damage caused by natural factors in nature (natural damage). The second damage cause was identified as damage caused during the collection process or remounting herbarium specimens (in the process of collecting damage). The last cause or source of damage is herbarium specimen damage caused by insects at the specimen storage location (damage by insects).
Differentiating between pre-collection and post-collection herbivory on herbarium specimens is a challenge. Pre-collection herbivory on the leaves of some plant species can be distinguished by the presence of a thin and darkening contour around the damaged area. It means the plant was still alive when the herbivory killed the cells in a specific area [6]. If localized cell death does not occur surrounding the injured area, post-collection herbivory or storage-related damage is assumed [26]. We discovered leaf damage morphology in Excoecaria was similar before and after collection, so we used the same method to distinguish pre-collected herbivores and used the curator's opinion to differentiate pre-and post-collection damage.
The specimen damage due to the mounting or remounting process is usually indicated by the presence of an envelope attached to the specimen sheet. The envelope helps accommodate broken stems or torn leaf pieces. Process-damaged leaves are often seen at the leaf tips or margins, not on the inside of the leaves. One of the causes of leaf damage during the process is leaf folds during the drying process, which causes the leaf shape to become imperfect. In addition, the leaves and stems are ripped or broken during the transfer procedure from the old specimen paper to the new specimen paper because of their fragility.

C. Statistical Analysis
This study was divided into three causes of damage to herbarium specimens (as a response variable). First, damage that occurred before the specimen collection process or damage caused by natural factors in nature (natural damage/BP). The second damage cause was identified as damage caused during the collection process or remounting herbarium specimens (in collecting damage/IP).
The last cause or source of damage is herbarium specimen damage caused by insects at the specimen storage location (damage by insects).
Systematic identification of damage in a herbarium specimen is based on the number of damage spots along with identifying the source of damage (BP, IP, and caused by insect). Based on this, the response variable in the study is the counted data. The Kolmogorov-Smirnov test was applied to assess distribution fit inferentially [27]. So, it cannot use the usual linear regression approach to model the phenomena in this study. The Generalized Linear Model (GLM) approach can model data whose parameters are not linear. Modeling with GLM requires three main components: a random component, a systematic component, and a link function [19]. There are at least two non-linear regression approaches with the counted data in response: Generalized Poisson Regression and Negative Binomial Regression [20].
Generalized Poisson Regression (GPR) has been proven to be good in modeling the response variable in the form of counted data [20]. As the name implies, the response variables in the GPR model are distributed according to the Generalized Poisson distribution (GPD). This GPD is good at modeling overdispersion and under-dispersion data conditions [20] [21]. Another approach to modeling the counted data is Negative Binomial Regression. In this study, the negative binomial distribution is a mixed function between Poisson-Gamma. The gamma distribution can accommodate overdispersion in Poisson regression because it does not assume equi-dispersion conditions in its application [20] [22].
This study attempted to explore the effect of species type, time, location, storage, and remounting status on the level of damage to herbarium specimens (especially those in the genus Excoecaria). In all models, the response was the total number of spots with BP, IP, and caused by insect damage to herbarium specimens (HS). The models were defined as: Number of spots damage before collecting process (BP): Number of spots damage caused by collecting process (IP): Number of spots damage caused by insect at storage collection (Insect): As shown in the above equation, there are three models of the level of damage to herbarium specimens. The first model, logit (BP) is a function of variable a, intercept, species type (categorical variable), age of collection (numeric variable), and origin of species (categorical variable). The second model, logit (IP)/level of damage due to the collection/remounting process, is a function of variables a, intercept, species type (categorical variable), age of collection (numeric variable), the origin of species (categorical variable), collection storage location, number of damage caused before collection (BP), and number of damage caused by insects in storage collection (categorical variables). Precisely for this second model, the samples used in the modeling are herbarium specimens that have undergone a remounting process. The third model, logit (insect), is a function of variables and intercept, species type (categorical variable), age of collection (numeric variable), origin of species (categorical variable), collection storage location (categorical variable), remounting status, number of damaged before the collecting process (BP), and the number of damaged insects in the storage collection (insect).
This study observed four species belonging to the genus Excoecaria, namely: Excoecaria agallocha, Excoecaria cochinchinensis, Excoecaria humilis, and Excoecaria oppositifolia. The origin of the specimens in the study was spread across nine locations, including Borneo, Celebes, Java, Kawasan_II, Malaypen, Moluccas, New Guinea, the Philippines, and Sumatra. Meanwhile, there are nine different collection storage locations in the focus of this research. The explanatory variable for remounting status is a variable that states whether a specimen has experienced remounting or not before. This study's explanatory variables are descriptions or labels (metadata) in a herbarium specimen. The data cleansing stage produced as many as 175 herbarium data specimens (which could be further analyzed). Furthermore, this study's entire sample of specimens will be modeled into three models described previously. A pre-analysis was conducted to see the relationship pattern between the response variables (BP, IP, and insects). If there is a significant correlation between them, it is necessary to do multivariate modeling. On the other hand, if there is no significant correlation between the response variables, it is sufficient to do univariate modeling (partial modeling for each response variable).
After assessing the closeness of the relationship between the response variables, the stages of statistical analysis are modeled with several modeling schemes, including modeling based on GPR or Negative Binomial Regression. As a comparison, modeling based on simple multiple linear regression is also carried out. The AIC (Akaike Information Criteria) parameter is used to assess which model best models the phenomena in this study. The lower the AIC value, the better the resulting model for modeling the phenomena contained in the study [22].
After obtaining the best model based on the lowest AIC value, the next stage tests to see which explanatory variables significantly affect the built model. This study applies a partial F test to see which explanatory variables significantly impact the model. The partial F test is a test that compares the full model (a model with all explanatory variables) with a partial model (a model without one of the explanatory variables, which will be tested). The logic is built to see the change in goodness models if one of the explanatory variables is omitted [28]. However, the Wald test was used for categorical variables to see which level of the categorical variables had the most significant impact on the damage to herbarium specimens [29].

A. Exploratory Data Analysis
In this study, the causes of damage were divided into three categories: firstly, the cause of damage is natural processes that occur while the specimen is still in nature (natural damage/before the collecting process). Secondly, the damage caused during the specimen collection process (inprocess damage), and the third was the damage to herbarium specimens caused by insects at the collection storage location (preservation damage by insects). In order to determine the modeling procedure later, the first step is to evaluate the correlations among the various causes of damage. This evaluation is intended to determine whether there is a correlation between the sources of damage. When there is a significant relationship between response variables, it is better to carry out a multivariate analysis procedure. On the other hand, if there is no correlation between the response variables (the source of the damage to the specimen), then partial modeling (univariate analysis) is carried out. Table 1 shows the correlation between sources causing damage to herbarium specimens, with a P-value exceeding α (5%), which indicates no significant correlation between the response variables. It indicates that there is no significant correlation between the response variables. So, a partial analysis procedure (univariate analysis) was applied in this study. Figure 3 shows a comparison plot of the number of damage events for each pair of sources causing damage to herbarium specimens: (a) between before process (natural damage) and inprocess damage; (b) between natural damage and preservation damage by insects; and (c) between in-process and preservation damage by insects. The picture shows the number of damage points on the herbarium specimens. Due to the collection process, the distribution pattern of damage points on herbarium specimens looks the same as the distribution pattern of collection damage points due to   [26].
Excoecaria agallocha is the most abundant species in the collection to be analyzed in this study (121 out of 175). Meanwhile, there were only four samples of Excoecaria oppositifolia. Because these specimens were given to the herbarium Bogoriense by other researchers, there is a low number of specimens from specific species and places. Figure 4 shows that the existing data are not normally distributed. Figure 5 shows the distribution of damage for each of the analyzed species. Figure 5a shows that the highest level of damage before the collection process occurred in Excoecaria cochinchinensis. However, we cannot conclude that this species was the most severely damaged before the collection process. In the box plot, there are slices of the same amount of damage as Excoecaria agallocha and Excoecaria humilis. In contrast to the pattern of damage caused by the remounting process (Figure 5b), it is seen that tremendous damage occurred in Excoecaria oppositifolia. The way that tends to be homogeneous occurs in the damage caused by insects in the collection storage area (Figure 5c). Visually, for each species, the level of damage tends to be the same. These visual findings need to be clarified inferentially. It is to obtain valid conclusions.
Visually exploring whether differences in specimen origin affect the damage to herbarium specimens is shown in Figure 6, which shows no significant differences between the origin of the specimen and the degree of damage (Figure 6a and Figure 6c). Different things can be seen in Figure 6b, it can be seen that specimens from Malaysia-peninsula have the highest level of damage compared to other specimens from the origin. Similar to the species variable, the specimen origin variable needs to be tested for inference to see a valid level of significance for the damage level of the specimen. Other variables also need to be clarified regarding their influence on specimen damage at the modeling stage.

B. Model Fitting
The normality distribution test for each damage cause is a critical process that must be performed to select the suitable model for analysis. Because the distribution of damage occurrences for all causes of specimen collecting damage is not normally distributed, as shown in Figure 4, Poisson or Negative Binomial models can be utilized in this investigation. Table 2 shows the Kolmogorov Smirnov distribution fittest results for those models.
The P-Value on the Negative Binomial exceeded 5% for all sources of damage, indicating that the Negative Binomial is the best model to study the factors that cause specimen collection damage. The AIC value comparison between Multiple Linear Regression, Generalized Poisson Regression, and Negative Binomial Regression confirms it. The Negative Binomial Regression approach obtains the AIC optimal score (the last one), as shown in Table 3.    Table 4 (partial F-Test) shows that the explanatory variables of specimen origin and species significantly affect the level of specimen damage before the collection process. The Wald test was carried out as shown in Table 5. This test was to see which group significantly affected the level of specimen damage before the collection process on each explanatory variable. In addition, this test also shows the direction of influence of each explanatory variable.
Excoecaria cochinchinensi is a species that significantly affects the damage to herbarium specimens (BP). A positive value in the estimated coefficient of this variable indicates that this species has a higher vulnerability to damage than the other three species. Natural damage was more common in the specimens of E. cochinchinensis than in the other three species. Table 6 shows that the difference in storage places significantly affects the damage during the remounting process. It indicates that different storage locations can affect the level of specimen damage due to this technical factor (remounting). No_PH7 has a higher level of damage due to remounting than other storage areas (see Table 7).
Modeling with the response variable of the level of damage due to insects at the storage location shows that only the explanatory variables of the storage area and the level of natural damage have a significant effect ( Table 8).
The Wald test in Table 9 shows the direction of the influence of the variable level of natural damage and the specimen's storage place. It is seen that the more damaged the specimen is due to natural factors, the higher the level of damage due to insects in the storage location. Meanwhile, locations No_PH10 and No_PH15 significantly adversely affected the level of specimen damage due to insects. It means that both storage areas have a lower level of damage than other storage  areas.

D. Discussion
Excoecaria cochinchinensi is a species that significantly affects the damage to herbarium specimens (BP). This species has the highest level of damage before the collection process compared to other species (the highest level of natural damage). The specimen's origin also significantly determines the level of susceptibility to damage to the specimen before undergoing the collection process. So, specimens from such locations as analyzed and Excoecaria cochinchinensis need to be treated more intensely in the following collection process. The damage caused by the remounting process on herbarium specimens is primarily due to the specimen storage area. There is a difference in the quality of the specimen storage area. It indicates the existence of non-uniformity in the management of storage media. Meanwhile, the damage caused by insects at the collection storage location is caused by the factors where the specimen is stored and the specimen level of damage before the collection process (natural damage before the collecting process). Storage areas appear to affect the rate of insect damage significantly. It indicates clearly that due to poor quality in certain storage places, in other words, the need for standardized specimen management. In addition, it can be seen that if specimens found before the collection process were damaged, they are more likely to be damaged by insects when stored.

IV. Conclusion
This study attempted to explore the effect of species type, time, location, storage, and remounting status on the level of damage to herbarium specimens (especially those in the genus Excoecaria). The response was the total number of spots with BP, IP, and Insect Damage Herbarium specimens (HS) with Negative Binomial Regression (NBR), Poisson regression, and ordinary Gaussian regression approaches. The experiment shows that the typical distribution-based regression modeling approach was not practical enough in modeling the damage phenomenon in herbarium specimens. The method based on the distribution of the enumerated data (amount of damage to herbarium specimens), predominantly Negative Binomial Regression, can better model the phenomenon of damage to herbarium specimens compared to GPR modeling and ordinary Gaussian regression models.
Based on Negative Binomial Regression modeling, it was detected that there was a nonuniformity in the storage process. The storage location factor significantly positively affects damage to herbarium specimens (caused by insects and the remounting process). The procedure for storing herbarium specimens needs to be standardized. Meanwhile, damage due to natural factors is caused by factors of different types of species. BO management needs to be concerned and handle the Excoecaria cochinchinensis species.