Parallelization of Partitioning Around Medoids (PAM) in K-Medoids Clustering on GPU

Clustering is the task of assigning unlabeled data points into a finite number of clusters. The assignment is usually based on similarity or distance, so data points located in the same cluster are similar to each other. Clustering techniques have been implemented and play important roles in a wide range of application domains, such as image segmentation [1][2][3][4], image clustering [5], bioinformatics [6][7] and data mining [3][8]. Through clustering, the underlying patterns of the data can be revealed. Clustering is unsupervised learning that gives information based on the intrinsic properties of the data when no labels are assigned to the data.

K-medoids clustering is categorized as partitional clustering. K-medoids offers better result when dealing with outliers and arbitrary distance metric also in the situation when the mean or median does not exist within data. However, k-medoids suffers a high computational complexity. Partitioning Around Medoids (PAM) has been developed to improve k-medoids clustering, consists of build and swap steps and uses the entire dataset to find the best potential medoids. Thus, PAM produces better medoids than other algorithms. This research proposes the parallelization of PAM in k-medoids clustering on GPU to reduce computational time at the swap step of PAM. The parallelization scheme utilizes shared memory, reduction algorithm, and optimization of the thread block configuration to maximize the occupancy. Based on the experiment result, the proposed parallelized PAM k-medoids is faster than CPU and Matlab implementation and efficient for large dataset.
There have been studies conducted by researchers to implement the clustering method using parallel computing approaches [13] [14] [15] [16] [17], especially k-medoids clustering, that will be the focus of this research. One of the technologies that used to develop parallel k-medoids clustering is Hadoop-MapReduce [18] [19] [20] [21] [22] [23]. MapReduce consists of map function and reduces function. Map function computes the distances between each data, and the medoids then assign the data to their clusters. Reduce function checks the results of Map function then search for new medoids. This function will return the results to the Map function in the next MapReduce process. In [14], the optimal search of medoids is performed based on the basic properties of triangular geometry. The speed of k-medoids clustering is improved when the validity of the clustering result is maintained [18]. Parallel k-medoids clustering can also be implemented on Graphics Processing Unit (GPU). Several GPU accelerated researchers have developed k-medoids clustering: parallel PAM implementation using CUDA [24] [25], GPU based parallel k-medoids (combined PAM-CLARA) clustering for remote sensing data [26], and GPU accelerated parallel clustering algorithms including k-means clustering, k-medoids clustering, and hierarchical clustering [27].
This work focuses on the development of parallel k-medoids clustering to increase the efficiency of partitioning around medoids (PAM) in k-medoids clustering. Our main contribution is to speed up the computation of the PAM algorithm using a parallelization scheme on GPU. The proposed parallelization scheme can handle large dataset with n number of data without creating n×n table of distance that consume huge memory but still maintain the computation speed. This paper is organized as follows: Section 2 presents the proposed parallelized PAM in k-medoids clustering, Section 3 presents the results and discussion, and the conclusion of this work is described in Section 4.

A. K-Medoids Clustering
K-medoids clustering is a partitional clustering method and is similar to k-means clustering because the goal of both methods is to divide a set of measurements or observations into k subsets or clusters so that the subsets minimize the sum of distances between a measurement and a center of the measurement's cluster. Unlike k-means clustering, which minimizes the distance between data points within a cluster with the mean value of those data points, which called centroid, k-medoids clustering attempts to minimize the distance between data points within a cluster with a representative data point in the same cluster. This representative data point has a minimum total dissimilarities/distances to the other data points in the same cluster. This representative data point is called the medoid. Each generated cluster in k-medoids clustering has a representative data point (medoid). Thus, k-medoids clustering guarantees that the center of a cluster is the most centrally located data point.
Medoids are initialized by selecting k data points arbitrarily. K-medoids clustering iterates until the objective function returns minimum value. The objective function / absolute-error criterion (AEC), E is defined in (1).
where is the data point in the cluster , is the medoid of .
K-medoids function involves several iterative algorithms that minimize the sum of distances from each object to its cluster's medoid, over all of the clusters. A well-known implementation of kmedoids clustering is the Partitioning Around Medoids (PAM) algorithm, which was developed by [28]. It takes two inputs: the number of clusters to generate (known as k) and a dataset D, which contains the data points. It generates k different clusters. The detail method of PAM is: Step. Select k data points arbitrarily from the dataset D as the initial medoids.

2) Swap
Step. Repeat a) Assign each non-medoid data point to the cluster, which has the most similar/nearest medoid. b) Randomly select a non-medoid data point. c) Compute the total cost of swapping that is the difference between the AEC calculated using the current medoid data point and the AEC calculated using the non-medoid data point selected in step b. d) If the AEC calculated using a non-medoid data point is lower than the AEC calculated using the medoid data point, swap the medoid to that selected non-medoid data point. Then, the non-medoid data point is set to be the new medoid of the cluster.
3) Until no more change on clusters.
K-medoids clustering is better than k-means clustering when dealing with outliers [10]. K-medoids is the opposite of k-means clustering, that is sensitive to outliers. The presence of outliers does not influence medoids. K-medoids also useful for categorical clustering data where the mean of the data does not exist within the dataset. However, k-medoids is costlier than k-means clustering due to the iteration that examines every data point.
The computational complexity of k-medoids clustering is ( ( − ) ) where k is the number of clusters, and n is the number of data [29]. For a large value of n and k, the computation is very costly. For a large number of data, the computation time will increase quickly as the number of data grows. Thus, k-medoids clustering is only suitable for small data and suffers inefficiency for big data.

B. CUDA Parallel Computing
GPU (Graphics Processing Unit) is a high-level parallel architecture used to do a fast operation in computer graphics, but now it can be used to perform computation other than graphics, which known as GP-GPU (General Purpose-Graphics Processing Unit) [30]. The well-known general-purpose parallel computing platform and programming model is Compute Unified Device Architecture (CUDA) from NVidia. GPU is highly parallel, multithreaded, has many-core processor and very high memory bandwidth. The difference between how CPU and GPU process the data is shown in Fig. 1 (a) and (b). GPU devotes more transistor to data processing rather than data caching and flow control. GPU is built on array of Streaming Multiprocessors (SM) and it is organized into grids, blocks, and threads. Data-parallel processing maps data elements to parallel processing threads. Fig. 1 (c) shows the parallel processing threads in GPU. A multithreaded program is partitioned into blocks of threads that execute independently from each other.

III. Results and Discussions
The proposed parallelized k-medoids are written using CUDA based on Matlab's PAM implementation [31] [32], and runs on Core-i7 7700K processor, 16 GB of RAM, and NVidia GTX  [30] 1070. The parallel implementation and performance evaluation is explained in the following subsections.

A. Parallelized K-Medoids Clustering
The implementation of parallelized PAM k-medoids is shown in Algorithm 1. It consists of three kernels that compute the gain of medoids to each data within a cluster, computes the gain of nonmedoids to each data, and computes the new medoids. In algorithm 1, the data and initial cluster medoids that randomly picked from data are copied from host to device. The algorithm computes new medoid per cluster and iterates until no medoid is changed. The data partition in the swap step is used to reduce the number of data processed by thread blocks. Algorithm 1. Proposed parallelized PAM k-medoids clustering.
READ parameters of k-medoids and GPU configuration SET random data as initial cluster medoids COPY data and initial medoids from host to device WHILE cluster medoids changed DO FOR EACH cluster DO compute gain of medoid to each data within cluster FOR n = 0 TO number of data partition DO compute gain of non-medoids to each data … in n-th partition END compute new medoid END END COPY cluster medoids and labels from device to host Algorithm 2. Compute the gain of medoids to each data within the cluster. Based on the complexity of k-medoids, the most complex computation is in the process of finding new medoids thus, it will be parallelized. If is the number of data, the PAM algorithm can be optimized using × table of distance that pre-calculated before k-medoids computation executed. However, creating × table of distance requires a large amount of memory. The goal is to optimize the ( − ) using parallelization scheme without creating a table of distance to avoid large memory consumption on GPU.
Algorithm 2 shows the computation of the gain of medoids to each data within the cluster. Gain is the sum of the distance between the medoids and each data in the same cluster. Shared memory is utilized to store the cluster medoids that repeatedly used in the calculation of distance. This will reduce the latency of global memory access to faster-shared memory access. Data are partitioned and processed by the thread blocks. Each thread computes the nearest medoid to each data, assigns the nearest medoids index to cluster label, and sum up the gain using atomic addition from CUDA in shared memory. The total gain that summed from each data then copied to the device memory.
Algorithm 3 shows the computation of the gain of non-medoids to each data. The configuration of blocks and grids is different from Algorithm 2. While is the number of data and is the number of clusters, then Algorithm 3 requires ( − ) data compared to data. Here, Algorithm 2 only requires data compare to data, where is usually a small number. Therefore in Algorithm 2, the thread blocks assigned to handle the outer ( ) loop and serialize the inner ( ) loop while in Algorithm 3, the threads assigned to handle the inner ( ) loop and the blocks assigned to handle the outer ( − ) loop. With this configuration, maximum occupancy can be achieved.
In the inner loop computation, each non-medoids is compared to the entire dataset. Total gain of each non-medoids is computed by adding the gain of medoid to each data within a cluster (from Algorithm 2) with the gain of non-medoids to each data from outer cluster then subtracted by the gain of non-medoids to each data within a cluster. Because the threads handle the inner loop, sum reduction can be used in shared memory to sum up the gain of non-medoids to each data from the outer cluster with the gain of non-medoids to each data within the cluster. Shared memory also utilized to store the cluster medoids to provide faster access in the distance calculation that similar in Algorithm 2. The total gain of each non-medoids then copied to device memory to be used in the computation of new medoids.
Algorithm 4 shows the computation of new medoids. New medoid is computed by finding the maximum gain that greater than zero from the gain of non-medoids in Algorithm 3. One block is used for the kernel to perform max reduction on data. The index of maximum gain indicates that the data which corresponds to that index has the highest potential as the new cluster medoid. The index and the new cluster medoid then copied to the device memory. If the maximum gain of non-medoids is less than zero for all cluster, then k-medoids iteration will stop otherwise the iteration continues.

B. Performance Comparison
The proposed parallelized PAM in k-medoids is tested using KDDCup dataset. The dataset is modified into a smaller set with a various number of data. We use = 5 to cluster 1,000, 2,000, 3,000, 4,000, 5,000, 10,000, and 20,000 data with 41 attributes. The same initial cluster medoids are used, and the algorithms are computed 10 times to get the average computational time. The experiment compares computational time between CPU implementation of PAM k-medoids with the proposed GPU implementation of PAM k-medoids without creating a pre-calculated table of distance. Fig. 2 shows the performance, is measured based on the computational time in milliseconds against the number of data. The red, orange, and green lines present the CPU, Matlab, and the proposed GPU implementation, respectively. Based on Fig. 2, the proposed parallelized PAM k-medoids achieves more than 11 times speed up from the CPU implementation. For the larger dataset, the computational time of CPU implementation rises significantly, which indicates the proposed GPU implementation is efficient in dealing with this problem. Fig. 2 also shows the performance evaluation of the proposed GPU and Matlab implementation. Matlab uses a table of distance in its PAM k-medoids computation. Matlab recommends the implementation of PAM to be used in a small dataset, which less than 3000 data. The speedup gain using this method compared to the CPU implementation is approximately 3 times and more for more massive datasets. By accessing the lookup table, distance computation on each iteration can be simplified to only reading the data, thus reduce the computational time. However, by using a pre-calculated table of distance, space complexity increases significantly for large datasets. The proposed GPU implementation achieves approximately 2~3 times speedup compared to the Matlab implementation. The result is acceptable because the proposed parallelized PAM does not use a precalculated table of distance; thus, the potential implementation for a larger dataset is possible.

IV. Conclusion
In this research, parallelized PAM k-medoids on GPU is proposed. PAM algorithm for k-medoids returns the best medoids compared to the other algorithms because it uses the entire dataset to find the best potential medoids. However, it has a high computational complexity. PAM algorithm usually implemented using a pre-calculated table of distance to avoid repeated distance calculation, which leads to massive memory consumption. The proposed implementation optimizes the distance computation of the PAM algorithm using a parallel scheme without the pre-calculated table of distance. From the experiment, the proposed parallelized PAM k-medoids is faster 2~3 times from Matlab and 11~15 times from CPU implementation. The proposed method can handle large datasets by performing data partition and process the data in parallel. For future work, the proposed method will be improved to handle categorical dataset, increase the performance using multi GPUs, and compared to the other parallel k-medoids clustering libraries.