High Dimensional Data Clustering using Self-Organized Map

As the population grows and e economic development


I. Introduction
Houses are promising investment commodity in the last decade.The house price index in Indonesia is also experienced inflation, as well as the level of sales of home property [1].The price is influenced by several factors such as the interest rate, inflation on house ownership loans, inflation in building materials prices, and inflation in workers' minimum wages.There are many different types of houses offered along with various features, which sometimes make prospective buyers confused to determine their choice.
Another three factors that influence house pricing are physical attributes, accessibility, and developer reputation [2].Physical attributes refer to house attributes that are visible and measurable, such as the land area, building area, number of rooms, number of bathrooms, and the availability of the living room.The accessibility refers to the house location that determines the ease of access to public facilities, such as hospitals, schools, campuses, markets, etc. Commonly, the closer location of a house with many public facilities may cause the more expensive of the house price.Some other economic phenomena that can affect the house prices are the interest rate, inflation and the Gross Domestic Products (GDP) [3].
Based on many considered features in determining house prices, the housing data are classified as a high-dimensional data.In some previous studies, Neural Network can be used to predict the price of a house [4] [5][6] [7] [8].Several approaches of regression techniques to predict the house prices also done by [9][10] which using the time-series data.However, in using a neural network or regression techniques, all feature values must be complete, is less applicable in the real condition.It is because the information received from prospective buyers is not always the same and complete.
Although by the neural network the missing input value can be replaced through the interpolation mechanism, when using the interpolation, the replaced value will be given under the assumption that it is related to other variables, which is not always correct in the house pricing case.For example, the first data has a value of 60 meters square of the land area and 30 meters square of the building area, while the second data has a value of 100 meters square of the land area and 50 meters square of the building area.If the third data has a value of 150 meters square of the land area but has a missing value on building area attributes, then the interpolation will return 75 meters square as the replacement value of the building area attribute, based on the assumption of the two previous data that the building area is half of the land area.The results of interpolation in housing cases are not always correct, because the building area of a house may have a greater value than the land area if the house has more than one floor.Therefore, the use of interpolation in house price predictions can cause inaccurate results.
This study tries to perform a clustering approach to give a recommendation for house prices.The clustering approach may extract the value of features of each cluster, which can be used as a recommendation for house prices.The clustering process is done by comparing all data in the dataset which will then be clustered based on the similarity of existing features.It is expected to provide information on price ranges that are in line with the features which are already known by prospective buyers.Thus, the process of predicting house prices is more applicable.Moreover, if the cluster produced can be easily distinguished from other clusters, the value of the inter-cluster feature will not experience overlapping.This makes it very easy to deal with data which contain a missing value.The price recommendation process can still be done by looking at the value of the known attributes and ignoring the unknown values.
There is some previous research implemented clustering approach to do prediction task in some cases.Two-stage clustering had been for predicting rented house price [11].The idea of this method is forming the rented house data into some clusters using a clustering algorithm based on the location, then creating the prediction model using linear regression neural network for each cluster formed.The clustering process is done by considering house location because this research believes that the nearer a house location to many public facilities (landmark), the rental price will be higher.By using this hybrid method, the effective cluster can be created, although the accurate rent price prediction still needs more improvements.
The two-stage clustering method using K-Means and Fuzzy Inference System also have been done to cluster house data [12].The data are clustered into four predefined clusters based on house price: cheap, medium, expensive and very expensive.The clustering method was implemented to see how the location of a house affected the house price.After those clusters have been formed, the values of centroid features from each cluster were obtained and used as initial values to build a fuzzy inference system.This research shows that the fuzzy clustering system cannot predict the same cluster as Kmeans, means that the prediction of the house price still low in the accuracy.Similar work also had been done by [13], which tried to predict house prices using three different methods such as Fuzzy, Artificial Neural Network and the K-NN.
Another clustering method, Fuzzy C-Means, also been used as a hybrid method in predicting cases.A hybrid method of Fuzzy C-Means and regression technique is used to predict the workload of a new driver [14].Fuzzy C-Means was used to generate a driver workload model based on the regression generated previously.Meanwhile, [15][16] also developed Fuzzy C-Means for predicting the software fault by using it as feature extraction method.
On the other hand, SOM had been applied to classify and label transient data signal [17].The sequence of stable and transient phase is extracted from the time-series signal data obtained from aircraft engines during the flights.SOM cluster and label the transient data by checking the similarity of the pattern.The accuracy of the labeled transient signal is excellent in robustness and visualization.
A Generalized SOM (GSOM) is the improvement of SOM, have been studied [18].The special characteristic of this method is it can automatically determine the best number of the cluster and also the shape of the cluster by using a 1-D neighborhood method.The 1-D neighborhood method was represented like the chain of neurons, which can automatically disconnect and connect with the other neurons.
SOM has also been implemented in the health sector to classify and predict female subject with unhealthy visceral fat levels in Japan [19].A map topology is formed from the neurons, where each neuron stores 13 health parameters that are used to detect visceral fat.This map topology is then trained using the SOM algorithm, and each neuron will be given a label that represents the visceral fat level.The test data prediction is done by finding a winning neuron, which is a neuron that stores the value of the closest / most similar feature to the data.
Self-Organized Map (SOM) Kohonen will be implemented to cluster high-dimensional data of housing data.SOM was proposed since it works based on topological arranged neurons, where each neuron has a different feature value.SOM also has a neighboring weight updating mechanism, which causes adjacent neurons to have similar characteristics.In other words, it is expected to improve the cluster visualization.This research uses K-means as a comparison approach.The performance of these methods are compared to discover the best algorithm for data high dimensional house-data clustering.

A. Dataset
The dataset consists of 189 housing data which are obtained from the property exhibition, held in March and August 2017.All data in the dataset have a different value of physical attributes, locations, and also have valid prices, are determined by the developer and valid until December 2017.This research uses [20] to obtain the exact number of public facilities around the house location, within a 1000 meter radius.All the feature values will be normalized to optimize the clustering process, All features that build this house dataset will be shown in Table 1.Meanwhile, the complete dataset can be accessed through [21].

B. SOM Clustering
SOM is one type of neural network, which is categorized as an unsupervised algorithm.SOM is built by using one or more layer of neurons and can be described as a topological map of neurons.In general, the SOM algorithm works by finding a neuron that has the most similar weight corresponding to the data, which is then called as the winning neuron, and then updates the weights of the surrounding neurons within the neighboring radius to form the cluster of neurons that have similar weights.The applied SOM algorithm is detailed as in [22]: • Initialization.In this first step, some of SOM parameters, such as vector weight of neurons, the map size, the learning rate and also the radius of neighborhood update (Nc) need to be initialized.The two-dimension rectangular map grid will be used in this research, while the size of the map will be tested to obtain the best size which performs the best clustering result.Meanwhile, each neuron contains a set of features value which already described in Table 1.
The learning rate represents how fast the algorithm will learn in each iteration.The radius of the neighborhood update refers to the number of neurons around the winning neuron that will be updated.
• Obtaining the winner neuron.Each data vector (x) in the dataset will be compared to each neuron weights (wi) contained in the topological map, and the data similarity (d) will be calculated by using the Euclidean distance, as written in (1).The neuron that has the closest distance to the data will be called the winning neuron (c).
• Neighborhood weights update.This step is an effort to make the weight of the adjacent neurons have similar weights.Updating the weights is done using the equation ( 2) and (3).
ℎ () is the learning rate ()for all neurons within the  and ℎ ()= 0 for all neurons outside the  . and  is the weight of neuron i and the winner neuron c,  =  .The distance of neuron i and neuron c (‖ −  ‖) is calculated based on the neurons positions in the grid map.
• Stopping criteria.The stopping criteria are determined by using (4), where e is the minimum allowable weight change of the neuron weights between the coressponding iteration (t) and the previous iteration (t-1)

C. Defining the Cluster
There are several assumptions used in defining clusters.For convenience, the topological map will be described in Figure 1 where each cell describes a neuron and the number written in each cell illustrate the amount of data that best matches the weight of the neuron..The more detailed illustration will be explained as follows: • The cluster should have at least two matching data in one of the neurons or it may have only one matching data with other data in the adjacent neurons.In Figure 1a, the red color has the insufficient condition to make a cluster.• If there are two cells or additional cells separated by empty cells (not adjacent), then every cell is going to be thought-about as a special cluster (Figure 1b).• If there are two or more adjacent cells, which have the matching data on them, then all of the adjacent cells will be considered as the same cluster (Figure 1c).Thus, the cells that do not have any numbers describe the neurons that have no compatibility with any data in the dataset.

D. The Measurement of Cluster Validity
In order to measure how well the results of a clustering process, the Silhouette coefficient and Davies-Bouldin Index are used.The principle of measuring silhouette coefficient is that a cluster is good enough if the distance between members in the same cluster is close, while the distance between two clusters is far enough so that each cluster can be easily recognized and separated from the other clusters.The Davies-Bouldin Index is used to evaluate cluster results by measuring the ratio of the spread of clusters and the distance between clusters.The Silhouette coefficient will be shown at (5), while the Davies-Bouldin Index will be shown at (6) to (8).
In ( 5), a is the mean intra-cluster distance, whereas b is the nearest-cluster distance.The value of Silhouette coefficient will be in the range of [-1, 1].The most effective cluster will be obtained if the value of Sil=1, therefore the worst value of the Silhouette coefficient is Sil=-1.When the value of Sil=0, it indicates that the clusters are overlapped.
In the Davies-Bouldin Index, the distribution of clusters will be calculated using (6).
is the number of members in the cluster i ( ) and  is the center of the cluster i.The distance between clusters is calculated using the Euclidean distance between the centroid of the cluster i and the centroid of cluster j.The ratio between  and  will be calculated using (7).
Then, the maximum value of the ratio () will be used to calculate the Davies-Bouldin Index, which is shown at (8).
Unlike the Silhouette coefficient, Davies-Bouldin Index (DBI) value has a range between [0 -1]. = 0 indicates that the ratio of data distribution in clusters is very good, while  = 1 shows the ratio of data distribution in clusters is very bad.

A. SOM Parameter Testing
There are some parameters on SOM that will be tested in this research.For each value in the parameter testing will be tested by 7 times.The first tested parameter is the radius of the neighborhood update ( ).This parameter serves to determine the area of weight updates in the neurons located around the winning neurons.The closer the neuron position to the winning neuron, the more significant weight changes will be so that the weight of the neuron will be more similar to the weight of the winning neuron.When testing the neighboring radius, other parameters are temporarily set by default with the following values: the map size of the neuron is 15×15, the learning rate is α = 0.05, and the maximum error is = 0.1.The percentage of values will be used for this parameter testing.For example, if the map size is 15×15 and the  =60%, that means the radius of the neighborhood update is  =9 (9 neurons above, 9 neurons on the left side, 9 neurons below and 9 neurons on the right side of the winning neurons).Table 2 will show the result of this parameter testing.
Table 2 shows that for each neighboring radius tested, the value of Silhouette coefficient always shows a negative value.This negative value of silhouette coefficient can be influenced by other parameters.The best Silhouette coefficient value is obtained when the neighbor radius is 67% of the map size.The best DBI value (the smallest DBI value) is also obtained when the neighbor's radius size is 67%.The test results show that when  =67%, the resulting cluster has a good ratio in terms of the number and distance between clusters, but still does not perform an effective cluster.The size of the neighboring radius that is too large (80%) causes the cluster boundaries to be less clear because the area of the neuron whose weight will be updated is too broad.This will allow the weight of a neuron look similar to that of one cluster.In the other hand, when the size of the neighboring radius is too small will cause the cluster formation process will last very slow.In testing other parameters, the neighboring radius will be set to 67% of the map size.
The next tested parameter is the learning rate (α).The result of the learning rate will be shown in Table 3.Based on the tests performed, the best silhouette coefficient is 0.0958, which is obtained at the setting α = 0.06.While the best DBI is obtained at α = 0.05.All of the tested learning rate values show that the algorithm just needs 3-4 iterations for running the clustering process.This fact shows that the learning rate does not affect the number of iterations.By considering the average Silhouette coefficient, the average DBI, the best Silhouette coefficient, and the best DBI value, the next parameter testing will use the learning rate α=0.06.
The following tested parameter the maximum error (е) as the stopping criteria in the clustering process.The maximum error in SOM testing can be considered as a significant change in weight on neurons compared to the weights in the previous iteration.The test result of the stopping criteria will be shown in Table 4.In the stopping criteria, the smaller the error value specified, the more similar the weights of the map from the current iteration with the previous iteration.Based on the test results, the greater the error value results in the fewest iterations needed for a clustering process.The test results also show that at e=0.45 and e=0.5 there is increasing value of Silhouette coefficient and DBI, both on average and the best value.This can happen because the clustering process will immediately stop the program when a very significant change in weight occurs, whereas in this condition, there has not been a lot of data transfer from one neuron to another, so the data is still quite scattered.When the clusters are quite diffuse, it is possible for the clustering result to obtain the better Silhouette coefficient value as well as the DBI value.Based on the testing, the best stop criteria occur when e=0.5, because it shows the best average value of Silhouette coefficient and DBI.However, the best silhouette coefficient values are obtained when e=0.4.Thus the stop condition is set with the value е = 0.5 to for the next parameter testing.
The last testing parameter is the size of the topological map.The test result is shown in Table 5.The test results show that the best map size is 30×30.When map size is 10×10 shows the worst results because the number of neurons in it is much smaller than the training data used, which is 189 house data.Thus, a neuron can be matched with a lot of home data so that a good cluster is very difficult to achieve.The 10×10 maps will provide very limited distances to make separate distances between clusters.As consequence, the Silhouette coefficient will be very small.
After doing several parameter testing, the best number of clusters obtained by using SOM is n=2, although in some testing the number of clusters can reach up to n=4.The visualization of the best clustering result is shown in Figure .2

B. K-Means Result
The following sub-section will discuss the implementation of other clustering algorithms as a comparison of the SOM algorithm.Unlike the SOM algorithm, the number of clusters in the K-Means algorithm must be specified before testing.The parameters that will be tested in K-Means are the number of clusters and also the stopping criteria (e).Table 6 and Table 7 will show the testing result of the K-Means algorithm based on the Silhouette coefficient and DBI.All values written in the table are the best values for each group testing.Based on tests performed using K-Means, the best Silhouette is obtained when the number of clusters specified as n = 6 and e = 0.5.However, the number of clusters that actually formed as the clustering result does not exceed n = 3.In testing the K-Means method, almost all of the parameter values tested performs a negative Silhouette coefficient value.This indicates that the resulting cluster is not right and still difficult to distinguish between one cluster and another.The best number of clusters obtained is 3 clusters.

C. SOM and K-Means Comparison
Based on the test results, both SOM and K-Means are still difficult to achieve good clustering results.It is proven by the negative value of the average Silhouette coefficient, indicates that many data are send to the wrong cluster.However, the best Silhouette coefficient achieved by SOM (0.4367), is better than K-Means (0.236).In this, case the SOM has a better ability to build valid clusters compared to the K-Means.SOM algorithm represents the data in the form of a twodimensional topology map.In SOM, data is placed on neurons.As a result, the internal distance among the member of the cluster and the distance between clusters is easier to be measured.
In terms of data distribution, SOM shows better performance compared to K-Means.In SOM, each cluster can be clearly identified by searching for the grid distance between clusters on the map, so that the distance between clusters can be calculated easily and clearly.In clustering the data, SOM compares input data vectors to the weights of neurons.But in the K-Means, input data vector is compared to the value of the centroid.The centroid values in K-Means always been updated on each iteration with the average value of its members' features.Thus, a centroid does not always indicate a point in the dataset.This may create difficulties in determining the cluster area and cluster distribution.
Although SOM shows better performance compared to K-Means, for clustering the highdimensional data it still needs more improvements.This is because in the SOM, in determining the winning neuron is done by calculating the similarity between the data and the weight of the neurons  by using the calculation of the Euclidean distance.In calculating the Euclidean distance, all features are calculated using the same weight.Whereas, in high-dimensional data, not all features are relevant.This makes considering all features in the calculation can actually be a disruption to form a valid clustering result.In fact, some features in high-dimensional data can be referred as the noise [23].
Considering the characteristic of the data, SOM can be modified by using different distance measurements, for example using Manhattan distance [24][25] [26].

IV. Conclusion
SOM can be used to cluster housing data and successfully shows better performance compared to the K-Means algorithm.SOM outperforms K-Means in terms of visualizing the high-dimensional data clustering.In other words, it provide easier calculation to obtain the cluster validity.In addition, SOM also showed a better performance in the process of forming good clusters, is indicated by obtaining better Silhouette coefficient and DBI values.However, SOM still needs some improvements to produce better clustering results.

Fig. 1 .
Fig. 1.The illustration of defined cluster; (a) the minimum condition of a cluster; (b) a special cluster; (c) a cluster with two adjacent cells

Table 1 .
The list of observed attributes of a house data

Table 2 .
Neighborhood radius testing

Table 3 .
Learning rate testing

Table 4 .
Maximum error testing

Table 5 .
Map size testing

Table 6 .
K-means clustering result based on the silhouette coefficient