K-Means Clustering and Multilayer Perceptron for Categorizing Student Business Groups

ABSTRACT


I. Introduction
SMA Double Track is a flagship program of East Java Province in the field of education that is packaged as an extracurricular activity in senior high schools aimed at developing students' entrepreneurial skills.In this activity, students will learn about the skills they are interested in and the ins and outs of the business and gain real-life experience in running a business.So, even if they cannot continue their higher education, they can establish their businesses or work in their local area according to their acquired skills [1].However, it is necessary to pay attention to the challenges of these activities, which include infrastructure, resources, curriculum management, and the perception and understanding of the community towards the educational program.The training method for SMA Double Track utilizes a group system called Student Business Group (KUS), with each KUS consisting of 5-6 students.The aim is for each student to have a role and responsibilities in running their business.The target is for each KUS to be capable of selling products or services resulting from their training to the community, thus enabling them to generate transactions or revenue.Each year, the number of transactions for each KUS per topic is recorded by the East Java Provincial Education Office to determine the potential for developing students' businesses into start-up companies that will receive business capital assistance.KUS in the SMA Double Track system is crucial in providing The research conducted in this study was driven by the East Java provincial government's requirement to assess the transaction levels of the Student Business Group (KUS) in the SMA Double Track program.These transaction levels are a basis for allocating supplementary financial aid to each business group.The system's primary objective is to assist the provincial government of East Java in making wellinformed choices pertaining to the distribution of supplementary capital to the KUS.The classification technique employed in this study is the multilayer perceptron.However, the K-Means Clustering method is utilised to generate target data due to the limited availability during the classification process, which involves dividing the transaction level attributes into three distinct groups: (0) low transactions, (1) medium transactions, and (2) high transactions.The clustering process encompasses three distinct features: (1) income, (2) spending, and (3) profit.These three traits will be utilized as input data throughout the categorization procedure.The classification procedure employing the Multilayer Perceptron technique involved processing a dataset including 1383 data points.The training data constituted 80% of the dataset, while the remaining 20% was allocated for testing.In order to evaluate the efficacy of the constructed model, the training error was assessed using K-Fold cross-validation, yielding an average accuracy score of 0.92.In the present study, the categorization technique yielded an accuracy of 0.96.This model aims to classify scenarios when the dataset lacks prior target data.
students with practical experience and entrepreneurial opportunities.The activity has a positive impact on students, including providing them with the opportunity to apply the knowledge they have acquired in real-life situations, helping them develop crucial entrepreneurial skills for their future, and creating a collaborative environment where students can work together in teams, share ideas, and build strong networks.Students can also enhance their self-confidence by facing challenges and taking risks.KUS can serve as a means to implement integrated learning among the subjects taught in the academic and vocational tracks.
The SMA Double Track program has many KUS, making it difficult for the province government to assess the volume of transactions, which influences the decision to provide more cash to each KUS.Consequently, the provincial government needs a transaction classification system to make decision-making easier.The Convolutional Neural Networks-Recurrent Neural Network (CNN-RNN) resulted in an accuracy value of 75% [2].Long Short-Term Memory (LSTM) yielded satisfactory precision, recall, and f1-score values [3].The use of K-Nearest Neighbors (K-NN), CNN, LSTM indicated that the KNN algorithm performed the best among the machine learning algorithms in this case, achieving an accuracy of 83.82% [4].Supervised machine learning models, including linear, non-linear, and ensemble models classified harmful and non-harmful activities.This study showed that linear and non-linear machine learning outperformed ensemble learning in classifying Ethereum blockchain addresses [5].All methodologies are contingent upon the availability and utilization of data.
The classification [6][7] research that has been conducted requires input data and target data.However, the transaction data from the SMA DT, KUS, does not have target data yet.Therefore, a method is needed to create target data.Hence, in this study, a combination of methods is employed to classify the transaction levels of the KUS.The methods used in this research are K-Means Clustering and Multilayer perception [8].The K-Means Clustering [9][10] method is utilized to create transaction-level classes with three levels: low, medium, and high transaction levels [11].Meanwhile, the Multilayer Perceptron (MLP) [12][13] is employed to determine the transaction level of the double-track student business groups.K-means clustering is one of the popular algorithms in data analysis used to group data into different clusters based on similarities in features or attributes [14] [15].On the other hand, the MLP is one type of structured Artificial Neural Network (ANN) [16] architecture that utilizes supervised learning methods [17], known as backpropagation, for classification purposes [18] [19].The MLP is chosen because it is highly effective, easy to implement, and provides good results in many cases [20][21] [22].The capability of MLP, compared to several other methods such as Support Vector machines (SVM) [23], can yield better results [24] [25].In addition, compared to the Decision Tree and Random Forest methods, MLP can achieve a higher accuracy rate of 80% [25].Additionally, MLP is better than CNN [26].Subsequently, compared to the polynomial regression method, MLP shows better performance [27].Python was chosen as the programming language for this study since it is considered one of the easiest to learn and utilize [28].A machine learning model developed from this study can execute data target labeling and categorization with the highest precision and accuracy.This model is anticipated to help the East Java Provincial Education Office determine the transaction amounts and streamline the decisionmaking process for each KUS receiving financial support.

II. Method
The research methodology is a framework researchers use when conducting a study, encompassing the stages from data collection to data analysis.These stages are carried out in a structured and systematic manner.The research stages are presented in Figure 1.The next step is data preprocessing, which consists of two processes.Firstly, the data cleaning is performed to handle outliers and missing values.Secondly, the attribute data is normalized using the min-max method with a value range between 0 and 1.The min-max method can be calculated as in (1).
′ is normalized result value,  is actual data value to be normalized,  is minimum value of the actual data, and x is maximum value of the actual data.
After completing the data preprocessing, the next step is to perform data clustering using K-means clustering with three clusters: low, medium, and high.K-means clustering is chosen for its advantages, as it can efficiently group large objects, thereby expediting the clustering process.This capability has been demonstrated and proven in several studies such as [29][30] [31][32] [33].The pseudocode for the K-means algorithm can be seen in Pseudocode 1.
Arbitrarily choose k dsts-items from D as initial centroids; 2.
Repeat Assign each item d1 to the cluster which has the closest centroid; Calculate new mean for each cluster; Until convergence criteria is met.
The classification system will use the clustering results as labels or target attributes.After the labeling process, the data is divided into training and testing data.The dataset is randomly split, with 80% as training data and 20% as testing data.Next, the model validation process is conducted to assess the performance of the built model using K-fold cross-validation on the training data.After that, the classification process is performed using the multilayer perceptron method.The multilayer perceptron is chosen for its ease of implementation and good results in various cases.This capability has been demonstrated and proven in several studies [34] [35][36].This test uses 5.2 hidden layers and 300 max_iter to achieve good accuracy values.Next, the model's performance is evaluated using k-fold cross-validation with five splits.Afterward, the classification model's performance is evaluated using the confusion matrix to obtain precision, recall, and f1-score values.The pseudocode for the MLP algorithm can be seen in Pseudocode 2.

III. Result and Discussion
The dataset obtained from the SMA Double Track website consists of 1547 records with 16 feature attributes, including school, district, topic name, topic, income, expenditure, profit, catalog, screenshot, online shop link, Instagram link, product poster, description, chairman's name, chairman's phone number, and chairman's address.Of these 16 features, only three are used: income, expenditure, and profit.The dataset contains missing values and outliers, necessitating data cleaning and normalization processes to ensure they do not interfere with the calculation process.
After completing the preprocessing process, the next step is the data clustering process using Kmeans clustering with a total of 1383 data points and three attributes: income, expenditure, and profit.The distribution of the data used in this research based on topic data can be seen in Figure 2.
Figure 2 shows the number of data for each topic, where the topic of designing Muslim fashion has 48 data, the topic of fashion design has 95 data, the topic of bridal hijab makeup has 163 data, the topic of hair styling has 17 data, the topic of stage makeup has 63 data, the topic of photography has 60 data, the topic of video editing has 37 data, the topic of graphic design has 245 data, the topic of pastry bakery processing has 340 data, the topic of Indonesian food preparation has 59 data, the topic of snacks and beverages has 167 data, the topic of motorcycle tune-up has 66 data, and the topic of electronic equipment maintenance and repair has 23 data.From these 1382 data points, they will be clustered into three classes, namely (0) low, (1) medium, and (2) high, with centroid values as shown in the Table 1.The visualization of the clustering results based on the three features used is present in Figure 3.This explanation shows that class 0 ("low") has a higher number of data points than the other transaction classes.Figure 5 shows the comparison of transaction classes for each topic.The clustering results are used as labels or target attributes in the classification system.These results will be manually divided to create training data and testing data.From the dataset, 80% will be randomly selected as training data, while the remaining 20% will be used as testing data.Next, the data validation process is performed using K-fold cross-validation on the training data, and the results are shown in Table 2.After that, the classification process is performed using the MLP.This MLP test uses 5.2 hidden layers and 300 max_iter to achieve good loss results.The loss curve graph is shown in the Figure 6.The validation of the multilayer perceptron model for classifying transactions resulted in an accuracy of 0.96.After the classification process, the matrix was tested using a confusion matrix with calculations based on the classification report.The results are shown in Table 3. From Table 3, it can be concluded that the average accuracy value of the f1 score is 0.96.The average is 0.90, and the weighted average is 0.96.These accuracy results indicate that the multilayer perceptron method is highly effective in classifying data for the Double Track student business groups.

IV. Conclusion
The present study has produced a promising framework that integrates two separate methodologies, enabling the simultaneous execution of clustering and classification tasks.This novel framework has significant use in situations with a dearth of predetermined target data.The model presented in this study holds significant potential for a wide range of applications, primarily aimed at providing valuable assistance to the East Java Provincial Education Office.Its main objective is to acquire insights into the transaction behaviors exhibited by the SMA Double Track student business groups.These observations can provide a basis for formulating policies to provide more money to student-led firms, thus enhancing their entrepreneurial initiatives' quality and long-term viability.
This study showcases the impressive accuracy of the k-means clustering and multilayer perceptron algorithms in effectively identifying the transactions of the Double Track student business groups, highlighting their dynamic synergy.The k-means clustering technique was crucial in producing the desired dataset by categorizing transaction levels into three unique classes: (0) representing low transactions, (1) representing medium transactions, and (2) representing high transactions.The clustering procedure took into account three fundamental features, namely: (1) revenue, (2) spending, and (3) profit.
The categorization outcomes obtained utilizing the multilayer perceptron exhibited a noteworthy accuracy rate.In order to evaluate the model's overall performance, a comprehensive analysis of training errors was carried out using K-Fold cross-validation.In considering the future trajectory, it is crucial to improve both the K-means clustering and multilayer perceptron models to fully harness their capabilities and advance the effectiveness of these models.Furthermore, it is suggested that the scope of model creation be expanded to encompass comparative analyses utilizing various approaches, will aid in establishing benchmarks that can be used to assess the quality and comprehensiveness of the model.This prospective investigation presents an intriguing undertaking with the potential to revolutionize transaction analysis and policy development for student-led enterprises inside the educational sector.

Fig. 1 .
Fig. 1.Research stagesThe data collection process was obtained from the official website of the SMA Double Track program, www.smadt.id, to be used as testing material for this research.The data obtained consists of 1547 records with 16 feature attributes, including school, district, topic name, topic, income, expenditure, profit, catalog, screenshot, online shop link, Instagram link, product poster, description, chairman's name, chairman's phone number, and chairman's address.Out of these 16 features, only 4 are utilized: topic, income, expenditure, and profit.The dataset is stored in Excel file format to facilitate the calculation process.

PSEUDOCODE 2 :
MLP Input : The features vector each userStart with random initial weights (i.e. , uniform random in [-5,is sufficiently small or "Time_Out" Output : The User ID identification result.

Fig. 2 .
Fig. 2. Distribution of data by topic

Fig. 4 .
Fig. 4. Visualization based on transaction class Figure 4 illustrates the number of data points in each transaction class, where there are 1180 data points in class 0 (low), 163 data points in class 1 (medium), and 40 data points in class 2 (high).This explanation shows that class 0 ("low") has a higher number of data points than the other transaction classes.Figure5shows the comparison of transaction classes for each topic.

Fig. 5 .
Fig. 5. Comparison of transaction classes in each topic

Figure 5
Figure 5 indicates that class 0 or 'low' transactions are most prominent in topics 9 (Pastry Bakery Processing), 8 (Graphic Design), and 3 (Hijab Bridal Makeup).As for class 1 or 'medium' transactions, they are most abundant in topics 10 (Indonesian Food Making), 9 (Pastry Bakery Processing), and 11 (Snacks and Beverages).On the other hand, class 2 or 'high' transactions are most prevalent in topics 9 (Pastry Bakery Processing), 11 (Snacks and Beverages), and 8 (Graphic Design).

Fig. 6 .
Fig. 6.Graph of the loss curve Next is the comparison between actual values and the classification prediction results.The comparison results can be seen in Figure 7.

Fig. 7 .
Fig. 7. Comparison of actual values and predicted values

Table 1 .
Centroid values based on income, spending, and benefits

Table 2 .
K-Fold model validation results

Table 3 .
Matrix testing results based on class