Multivariate Analysis Approach to Factor-Affected Tuberculosis Disease

ABSTRACT


I. Introduction
Tuberculosis is an infectious disease caused by Mycobacterium tuberculosis, which attacks organs other than the lungs [1].This disease is a problem for developing countries with declining socioeconomic conditions [2].The prevalence rate of cases of pulmonary tuberculosis in Indonesia is 130/100,000 [3].Every year, there are 539,000 new cases, and the number of deaths is around 101,000 people per year [4].AFB (Acid Fast Bacilli) pulmonary tuberculosis (+) incidence is around 110/per 100,000 population [5].TBC (Tuberculosis) is the third leading cause of death, after heart disease and respiratory disease [6].According to [5], Indonesia is fifth after India, China, South Africa, and Nigeria.
The leading causes of increased tuberculosis problems are declining socio-economic conditions in people in developing countries [7], environmental conditions inside and outside the home that are very supportive for the occurrence of TB (Tuberculosis) disease [8], demographic changes due to the increasing world population and changes in the age structure of the population [9], the impact of the Tuberculosis is a disease caused by infection with the mycobacterium tuberculosis complex.Tuberculosis attack organ besides the lung, such as the pleura, lining of the brain, lining of the heart, lymph gland, bones, joint, skin, intestines, kidney, urinary tract, and genital.This disease is found in densely populated settlements with poor sanitation, lack of ventilation and sunlight and lack of rest.Moreover, the factors that will be analyzed in this research are Population Density (X1), Number of HIV/AIDS (X2), number of toddlers who experience nutrition (X3), Number of toddlers who experience BCG immunization (X4), number of toddlers who get exclusive breastfeeding (X5), Total families with PHBS (X6), number of residents with healthy homes (X7), number of families with clean water facilities (X8), number of families with ownership of latrine sanitation (X9), number of families with have landfills (X10), number of families have management waste place (X11), number of elementary education facilities (X12), Number of junior school education facilities (X13), Number of senior school education facilities (X14), Number of institutions fostered by neighborhood health (X15), Number of Posyandu (X16), Number Life Expectancy (X17), Literacy Rate (X18), Human Development Index (X19), Number of Tuberculosis sufferers (X20).This research aims to analyze what variables influence each other on the prevalence rate of tuberculosis in the city of Surabaya.The method used in this research is a multivariate analysis using factor analysis, cluster analysis, biplot analysis and discriminant analysis.This discriminant analysis determines accuracy by calculating the value (1-APER).The resulting research the Number of HIV/AIDS, number of residents with healthy homes, and Number of families with ownership of Sanitation (latrine, landfills, waste management) have a high correlation with the spread of tuberculosis in Surabaya.Meanwhile, areas with a high rate of tuberculosis are Tambaksari, Wonokromo, Sawahan, and Semampir.The classification analysis accuracy level was 90.32% and the accuracy of the resulting model or discriminant function was very high.So that discriminant analysis can be used for predicting the accuracy of tuberculosis prevalence rates.
HIV/AIDS pandemic [1].The tuberculosis program has not been optimally implemented, which includes poor health infrastructure in countries that experienced an economic crisis, lack of implementation of tuberculosis services (less accessible to the public, not guaranteed provision of OAT, and non-standard monitoring, recording, and reporting [1]. The prevalence rate of tuberculosis is not only a medical problem; socio-economic conditions and environmental factors also have an influence [11].For example, those with a low socioeconomic status will have a house in a slum area, an unhealthy house with a lack of air circulation, no sanitation, poor nutritional conditions, and a lack of clean water in their environment.According to research conducted by Sejati and Sofiana [12], people with family incomes below the minimum wage have a 1.123 times higher risk of being infected with TB(Tuberculosis) than those above the minimum wage.Education level is one of the factors that influences the incidence of tuberculosis [13].The higher a person's education level, the lower the incidence of tuberculosis [14], this happens because someone who has a good education will get more information and be able to absorb information about tuberculosis well and be able to treat it well.Apart from education, lighting or sunlight entering the house and the ventilation conditions of the house are also factors that influence the incidence of pulmonary tuberculosis [15].Surabaya is the second largest city in Indonesia, with an area of approximately 326.37 km2; administratively, it is divided into 31 districts and 163 sub-districts with a population of approximately 2,912,197 people [16].Based on [17], the highest tuberculosis in East Java is in Surabaya.At least 4,493 residents living in Surabaya have tuberculosis.This disease is found in densely populated settlements with poor sanitation, lack of ventilation and sunlight, and lack of rest [18].TB cases in Surabaya are pretty significant compared to other cities [19], so there is a need for research or theoretical studies on the factors influencing the tuberculosis prevalence rate.Different characteristics, such as economic conditions and sociocultural factors in each region in Surabaya, will cause different health quality [20] so it is necessary to group areas with tuberculosis incidence characteristics.The goal of this research is to find out what factors influence the prevalence rate of tuberculosis in the city of Surabaya and to group regions based on the characteristics of the incidence of tuberculosis, with the hope that this research can help the Surabaya city government in handling tuberculosis prevalence rates quickly and accurately The analysis technique used is multivariate analysis.This analysis is used to test more than two variables simultaneously.The multivariate analysis approach is divided into two main methods, namely dependency and interdependence [21].This research was carried out using an interdependence approach.The types of multivariate analysis methods used are factor analysis, cluster analysis, biplot analysis and discriminant analysis.Factor analysis is used to reduce variables into new variables with fewer numbers.Cluster analysis groups observe areas based on the variable number of tuberculosis cases and the factors influencing them.Biplot analysis shows the closeness between objects, characteristics, or variables that characterize each object and the relationship between variables.Discriminant analysis was conducted to determine the differentiating variable and classification accuracy of the groupings obtained.All factors that will be examined are independent variables.The variables will be grouped into new variables, grouping sub-districts based on characteristics, knowing the mapping of the sub-district area and the accuracy of the classification of each factor used.The appropriate analysis in this research is multivariate analysis, by knowing what factors influence the prevalence rate of tuberculosis and knowing which areas have the number of tuberculosis sufferers, it is hoped that the government will be more responsive and quick in its handling of tuberculosis sufferers.

II. Methods
The analysis step for the method in this research present in Figure 1.The data used in this research is secondary data from the health services, Badan Pusat Statistika (BPS) and Badan Perencanan Pembanguna Kota Surabaya (BAPPEKO) [22].The data taken is data related to the prevalence rate of tuberculosis in the city of Surabaya.The observation units studied were 31 sub-districts in the city of Surabaya.namely Krembengan, Gubeng, Tegalsari, Bubutan, Simokerto, Kenjeran, Tandes, Rungkut, Sukolilo, Mulyorejo, Sukomanunggal, Lakasantri, Gayungan, Genteng, Tenggilis, Karang Pilang, Wonocolo, Gunang Anyar, Dukuh Pakis, Jambangan, Bulak, Wiyung, Asemrowo, Benowo, Pakal, Sambikerep, Pabean Cantikan, Tambaksari, Wonokromo, Sawahan, Semampir.The epidemiological factors for tuberculosis are BCG vaccination, inaccurate diagnosis, inadequate treatment, and control programs not implemented.Appropriately, endemic HIV infection, migration residents, self-medicate (self-treatment), increasing poverty, and services inadequate health [23].A factor that is no less important in TB epidemiology is socioeconomic status, low income, low income, overcrowded housing, unemployment, and low education [24].So the variables that will be examined in this The clustering method used is Single Linkage, Complete Linkage, Average Linkage and Ward's Method.The Single Linkage method determines the distance between clusters by knowing the distance between two existing clusters and then choosing the closest distance or close neighbor rule [25].The Complete linkage method (farthest-neighbor method) is used for the furthest inter-cluster distance (farthest-neighbor) between two objects in different clusters [26].Ward's method aims to obtain clusters with the smallest possible cluster internal variance [27].This method is very commonly used in determining clusters.This method is obtained by calculating the average value of each cluster and then calculating the Euclidean distance between each object.

III. Result and Discussion
Figure 2 shows a map of the number of tuberculosis patients in Surabaya.Marked in purple is the sub-district group that has the lowest number of tuberculosis, namely ranging from which ranges from 61 to 114, with the sub-districts of Tandes, Sukomanunggal, Customs Cantikan, Bubutan, Simokerto, Genteng, Tegalsari, Gubeng, Wonokromo, Wonokolo, Rungkut, Sukolilo.The color brown indicates the classification of the sub-district with the highest number of tuberculosis, ranging from 114 to 201, with the sub-district of Sawahan, Krembengan, Semampir, Kenjeran, and 16 to 61, with the Districts of Pakal, below, Asemrowo, Sambikerep, Lakasntri, Dukuh Pakis, Wiyung, Karang Pilang, pots, Gayungan, Gunung Anyar, Mulyorejo, Bulak, Tenggilis Mejoyo.White color is the classification of sub-districts with tuberculosis in the moderate category, Tambaksari.Reduce data dimensions that can explain as much as possible the diversity of data with several sets of variables that are fewer than the initial variable without losing the important information contained therein.
The inter-correlation test uses the Barlett test and data adequacy with KMO.The Kaiser-Meyer-Olkin (KMO) test is a statistical measure to determine how suited data is for factor analysis [28].The test measures sampling adequacy for each variable in the model and the complete model.The statistic measures the proportion of variance among variables that might be common variance.The higher the proportion, the higher the KMO value, and the more suited the data is to factor analysis [29].The following is the correlation testing hypothesis.H0 : ρ = I (between variables from the data of the factors that influence tuberculosis disease are not correlated) H1 : ρ ≠ I (between variables from the data of the factors that influence tuberculosis disease are correlated) 0.000 Table 1 shows that the Chi-Square value of the factors that influence tuberculosis is 649.145 with a P_value of 0.000.It was decided that P_value rejects H0, because the value of P value (0.000) < α (0.05).So it can be concluded that there is a correlation between the data variable that affect tuberculosis.The KMO value of the data is 0.771.From this value, it can be decided that it failed to reject H0, because the value of KMO (0.771) > 0.5, which means that the data on the factors that influence tuberculosis have accepted the data adequacy test to be analyzed further.From Table 2, it is known that there are four mutually independent factors, with a cumulative variance of 78.357%.The variable is divided into certain factor groups by selecting the most considerable loading factor value between loadings 1, 2, 3 and 4. The loading factor used is the loading factor, which is rotated varimax.Table 3 shows the variable grouped in factor 1 have HIV/AIDS, the number of children under five who received BCG immunization, the number of residents who have healthy homes, families who have clean water facilities, the number of families with ownership of sanitation (latrines, landfills, Waste Management Sites), Number of Posyandu, Number of TB Patients.Factor 1 reviews the quality of a person's health.Factor 1 is very prominent in the development of the spread of tuberculosis in Surabaya.Factor 2 includes population density, exclusive breastfeeding, clean and healthy living behavior (PHBS), and educational facilities (elementary school, junior high school, senior high school).Factor 2 reviews demography and education.Factor 3 includes Life Expectancy, Literacy Rate, and Human Development Index.Factor 3 reviews the Human Development Index.Factor 4 includes the number of toddlers who experience nutrition.
The Cluster Analysis that will be explicitly used in this study is Ward's Linkage method with Square Euclidian Distance.In Figure 3, the dendrogram is cut into four groups, and the 31 subdistricts in Surabaya are grouped as in Table 4.

Fig. 3. A map of the number of tuberculosis patients in Surabaya
Table 4 shows that Group 1 consists of 11 sub-districts, group 2 consists of 9 sub-districts, group 3 consists of 7 sub-districts, and Group 4 consists of 4 sub-districts, namely Tambaksari, Wonokromo, Sawahan, Semampir sub-districts.The Figure 4 is a picture of health, demographics and education, HDI and nutrition.Figure 4 shows that the Krembengan, Semampir, Sukolilo, Kenjeran, Wonokromo, and Tambaksari districts have high-quality health, high demography and high education.Tegalsari, Mulyorejo, Pabean Cantikan, Genteng, Simokerto, Gubeng, Wonocolo, and Sukomanunggal sub-district have high demographics and education, while the quality of health is low.Tenggilis, Sambikerep, Dukuh Pakis, Benowo, Pakal, Jambangan, Gayungan, Lakasantri, Bulak, and Asemrowo sub-district have the characteristic of low health quality, low demography and low education.Rungkut, Tandes, Gunung Anyar, Karang Pilang, and Sawahan sub-districts have high-quality health, high demography and low education.Biplot analysis of that area has been formed to find out the sub-district mapping seen from the tendency of the variable that influences Figure 6 shows that the variable waste management sites (X11) have the most incredible diversity because the vector length is the longest among the others.At the same time, the nutrition variable (X3) has a minor diversity or tends to be homogeneous because the vector length is the shortest.Variables that have a positive correlation are the number of toddlers who experience nutrition (X3), literacy rate (X18), clean water facilities (X8), HIV/AIDS (X2), healthy homes (X7), latrine sanitation (X9), landfills (X10), waste management sites (X11), BCG immunization (X4), number of posyandu (X16), environmental health development institution (X15), number of TB sufferers (X20), exclusive breastfeeding (X5), elementary education facilities (X12), junior high school education facilities (X13), senior high school education facilities (X14), population density (X1), clean and healthy living behavior (X6), HDI (X19).At the same time, the variable that has a negative correlation is life expectancy (X17).Variable waste management sites, landfills, latrine sanitation , healthy homes, BCG immunization, exclusive breastfeeding, clean and healthy living behavior, elementary, junior high school education facilities, population, senior high school education facilities, clean water facilities, nutrition, literacy rate, HDI, HIV'AIDS, number of posyandu, health development institutions, the number of TB contributes a lot to the sub-district of Tambaksari, Wonokromo, Kenjeran, Semampir, Krembengan.Life expectancy variable contributes to the sub-district of Asemrowo, Bulak, Jambang, Gayungan, Lakasantri, Pakal, Tenggilis, Benowo, Dukuh Pakis, and Sambikerep.Meanwhile, the sub-district of Sawahan, Rungkut, Tandes, Gunung Anyar, Karang Pilang, Benowo, Tenggilis Mejoyo, Pabaen Cantikan, Genteng, Bubutan, Tandes, Tegalsari, Simokerto, Gubeng, Sukomanunggal do not dominate the variables that affect tuberculosis.
Figure 7 shows that life expectancy (X17) has the most incredible diversity because the length of the vector is the longest among the others.The population variable (X1) has a minor vector diversity or tends to be homogeneous because the vector length is the shortest.Variables that have a positive correlation are the human development index (X19), junior high school education facilities (X13), senior high school (X14), population density (X1), health development institutions (X15), number of posyandu-Integrated Healthcare Center (X16), HIV/AIDS (X2), BCG immunization (X4), waste management site (X11), latrine sanitation (X9), healthy homes (X7), landfills (X10), clean water facilities (X8), exclusive breastfeeding (X5), elementary education facilities (X12), number of toddlers who experience nutrition (X3), literacy rate (X18), number of TB sufferers (X20), clean and healthy living behavior (X6).The variable that has a negative correlation is life expectancy (X17).Variables Human development index, educational facilities for junior and senior high schools, population density, a health development institution, number of posyandu, HIV/AIDS, number of babies immunized with BCG, waste management sites, latrine sanitation, healthy homes, landfills, number of families with facilities clean water, the number of babies who are exclusive breastfeeding, elementary school education facilities, number of toddlers who experience nutrition, the literacy rate, the number of TB sufferers, the number of clean and healthy living behavior have contributed a lot to the sub-district of Sawahan, Tandes, Krembengan, Wonokromo, Rungkut, Sukolilo, Tambaksari, Gunung Anyar, Kenjeran, Semampir, Sukomanunggal.The life expectancy variable significantly contributes to the sub-district of Gubeng, Tegalsari, Genteng, Lakasantri, Gayungan, Wonocolo, Dukuh Pakis, Jambangan, and Tenggilis Mejoyo.Whereas for the district of Bubutan, Asemrowo, Karang pilang, Mulyorejo, Bulak, Simokerto, Pakal, Benowo, Pabean Cantikan, Wiyung, Sambikerep did not dominate the variables that affect tuberculosis.
Subdistricts of Tandes, Semampir, Kenjeran, Rungkut, Krembengan, Sukolilo, Gunung Anyar, Sawahan, Tambaksari, and Wonokromo have contributed to the variables of nutrition, clean water facilities, healthy homes, educational facilities for elementary school, junior high school, senior high school, latrine sanitation, landfills, waste management, number of people living with HIV/AIDS, health development institutions, exclusive breastfeeding, number of TB, population density, BCG immunization, clean and healthy living behavior, HDI, literacy rate, number of posyandu.Karang Pilang, Jambangan, Sukomanunggal, Wiyung, Benowo, Sambikerep, Lakasantri, Pakal, Bulak, Pabean Cantikan, and Genteng contribute significantly to the life expectancy variable.The districts of Gubeng, Tegalsari, Asemrowo, Bubutan, Wonocolo, Mulyorejo, Gayungan, Simokerto, and Tenggilis Mejoyo have no contribution to the factors that influence tuberculosis.Before proceeding to discriminant analysis, the multivariate normal assumptions and assumption of homogeneity of covariance variant matrix.The average multivariate assumptions are tested to determine whether the data used is usually distributed [30].The primary requirement in conducting multivariate analysis is that data is multi-normally distributed.
From Figure 9, it can be concluded that the data is usually distributed.Visually, the QQ plot tends to form a straight line so that it can be concluded that the data assumptions follow a multivariate normal distribution and have been accepted.The results of the covariance variant matrix can be seen in Table 5.  Reject  0 if the P_value is less than 0.05 (this study uses a 95% confidence level).From the test results, it can be concluded that the data analyzed have the same covariance matrix.Then, the discriminant analysis can be continued.
From the discriminant analysis using the stepwise method, the following Table 6 shows the obtained results.Table 6 shows that 19 variables are confirmed in the grouping, and only four variables meet the criteria as differentiators.These variables are elementary education, landfills, exclusive breastfeeding, and literacy rate (AMH).So, it can be concluded that the groups distinguishing tuberculosis in Surabaya are education, exclusive breastfeeding, literacy and sanitation.The variable with the most significant coefficient contributes to differentiating groups.Based on the above, function one shows that elementary school education is a factor that plays a role in distinguishing the first and second groups.The second function shows that the literacy rate variable significantly differentiates the second and third groups.The third function shows that the exclusive breastfeeding variable has a role in differentiating the third and fourth factors.
Figure 10 shows that the grouping based on the discriminant function is correct because not all group members are spread around the centroid point of the group.In Group 1, there are members of Group three who enter Group One.In Group 2, two members enter Group 3. Groups 3 and 4 are around the group centroid point.In determining the results of discriminant analysis, the results of the total accuracy value (1-APER) are needed which are based on the classification table (Hosmer & Lemeshow, 2000).8 shows that the accuracy of the classification result for the four groups that have been formed is 0.9032 or 90.32%.There was a classification error in the grouping of variables that affected tuberculosis in Surabaya at 0.9032.so the APER value is known to be 0.0968, which means the error level in the data using discriminant analysis is 0.0968.There is an incorrect unit of observation (district) in the grouping.There is one observation in group 3 (Simokerto) that must be included in the first row of actual group 1.That is, if we look at the cluster analysis results, Simokerto is in Group 1 (by grouping TB patient numbers by region).Based on cluster analysis on actual data, the grouping of the Gunung Anyar area is in Group 2. In the observations, there is Group Three (Gunung Anyar).Wiyung, based on actual data in cluster analysis, they are in Group 3, while the observation group is in row two.

IV. Conclusion
The study's extensive analysis has yielded significant findings regarding the tuberculosis situation in Surabaya.It is worth mentioning that the sub-districts exhibiting the most significant tuberculosis burden have been identified as Sawahan, Krembengan, Semampir, Kenjeran, and Tambaksari.The categorization as mentioned above, plays a crucial role as an essential initial step in developing focused intervention tactics within these domains.
Significantly, the study has shed light on the complex relationship between an individual's health status and the spread of tuberculosis prevalence in Surabaya-the discovery, as mentioned earlier results from a rigorous factor analysis, which considered multiple variables.The factors examined in this study included the population of individuals living with HIV/AIDS, the rate of immunization coverage among toddlers for BCG, the prevalence of households with adequate living conditions, access to clean water facilities, availability of sanitation facilities such as latrines and waste disposal sites (TPS), the provision of posyandu services, and the incidence of tuberculosis cases.The convergence of these factors has shown the complex network of elements that contribute to the tuberculosis situation in the city.
Furthermore, the study has identified regions exhibiting a pronounced susceptibility to the exacerbation of tuberculosis prevalence.Tambaksari, Wonokromo, Sawahan, and Semampir Districts have been identified as areas of significant concern.This conceptualization of vulnerability enables the implementation of proactive actions in various domains, which may encompass the intensification of healthcare provision, the dissemination of public health awareness campaigns, and the improvement of healthcare facility accessibility.The study's implementation of discriminant analysis has produced a noteworthy degree of precision, surpassing the criterion of 0.5.The discriminant technique has demonstrated a noteworthy accuracy of 90.32% in predicting tuberculosis prevalence data.The translation results in a meager error rate of only 0.0968, which highlights the strong performance of the model utilized in this study.
However, although this study represents a substantial advancement in our comprehension of the prevalence of tuberculosis in Surabaya, it also emphasizes the necessity for more investigation.In order to enhance our understanding and improve the effectiveness of intervention approaches, it is recommended that future analyses consider including supplementary health variables that were not within the purview of this study.In addition, it is essential to conduct comparisons with various analytical methodologies in order to determine their effectiveness and accuracy, ensuring the utilization of the most efficient strategies to address the issue of tuberculosis in Surabaya.This study establishes the groundwork for a more comprehensive and efficient approach to addressing the issue of tuberculosis prevalence in urban areas

Fig. 1 .
Fig. 1.Analysis steps research are Population Density (X1), Number of HIV/AIDS (X2), number of toddlers who experience nutrition (X3), Number of toddlers who experience BCG immunization (X4), number of toddlers who get exclusive breastfeeding (X5), Total families with PHBS (Clean and Healthy Living Behavior) (X6), number of residents with healthy homes (X7), number of families with clean water facilities (X8), Number of families that have latrine sanitation (X9), Number of families that have landfills (X10), Number of families that have waste management sites (X11), number of elementary education facilities (X12), Number of Middle school education facilities (X13), Number of High school education facilities (X14), Number of institutions fostered by neighborhood health (X15), Number of Posyandu (X16), Number Life Expectancy (X17), Literacy Rate (X18), Human Development Index (X19), Number of Tuberculosis sufferers (X20).

Fig. 2 .
Fig. 2. A map of the number of tuberculosis patients in Surabaya

Fig. 10 .
Fig. 10.Plot of the discriminant function Factors that influence the prevalence rate of tuberculosis are Population Density, Number of HIV/AIDS, number of toddlers who experience nutrition, Number of toddlers who experience BCG (Bacillus Calmette Guerin) immunization, number of toddlers who get exclusive breastfeeding, Total families with PHBS (Clean and Healthy Living Behavior), number of residents with healthy homes, number of families with clean water facilities, number of families that have latrine sanitation, Number of families that have landfills, number of families that have waste management sites, number of Basic Education Facilities, Number of Middle School Education Facilities, Number of High School Education Facilities, Number of Institutions fostered by Environmental Health, Number of Posyandu (Integrated Healthcare Center), Expectation Rate Life, Literacy Rate, Human Development Index, Number of TB Sufferers.

Table 1 .
Correlation test and data adequacy

Table 4 .
Results of subdistrict grouping in Surabaya city

Table 5 .
The results of the covariance variant matrix

Table 6 .
The results of the covariance variant matrix

Table 7 .
The function of the discriminant equation After obtaining the discriminant equation, it is obtained in the discriminant equation function, as shown in Table7.Based on Table7, the function of the discriminant equation can be described as follow.

Table 8 .
Accuracy of classification