Parallel Approach of Adaptive Image Thresholding Algorithm on GPU

Image thresholding is used to segment an image into background and foreground using a given threshold. The threshold can be generated using a specific algorithm instead of a pre-defined value obtained from observation or experiment. However, the algorithm involves per pixel operation, histogram calculation


I. Introduction
Segmentation is a process that partitions image into segments [1]. Segmentation is useful for changing image representation into something more meaningful and easier to analyze, e.g., finding objects and boundaries. One of the methods to perform image segmentation is image thresholding. The method partitions image into background and foreground using a given threshold. This process is also called binarization because the segmentation result is a binary image that maps "0" pixel as background and "1" pixel as foreground.
In order to perform image thresholding, the threshold value can be determined manually by observation or experiment. However, in the adaptive image thresholding method, the threshold is generated using a specific algorithm. The algorithm involves per pixel operation, histogram calculation, and iterative procedure to search the optimum threshold. Therefore, it can be costly for a high-resolution image.
Some well-known adaptive image thresholding algorithms are Otsu [2], Iterative Self-Organizing Data Analysis Technique (ISODATA) [3], and minimum cross-entropy (MCET) [4]. Otsu method iteratively searches threshold that minimizes inter-class variance. ISODATA method iteratively updates the threshold until the average inter-class distance is less than a given threshold or reaches the maximum number of iterations. MCET method searches optimal threshold by calculating the crossentropy for all possible thresholds and selecting the one with minimum cross-entropy. The methods have been used in many image processing applications [5][6] [7][8] [9][10] [11] [12] to perform automatic image segmentation. Image thresholding is used to segment an image into background and foreground using a given threshold. The threshold can be generated using a specific algorithm instead of a pre-defined value obtained from observation or experiment. However, the algorithm involves per pixel operation, histogram calculation, and iterative procedure to search the optimum threshold that is costly for high-resolution images. In this research, parallel implementations on GPU for three adaptive image thresholding methods, namely Otsu, ISODATA, and minimum cross-entropy, were proposed to optimize their computational times to deal with high-resolution images. The approach involves parallel reduction and parallel prefix sum (scan) techniques to optimize the calculation. The proposed approach was tested on various sizes of grayscale images. The result shows that the parallel implementation of three adaptive image thresholding methods on GPU achieves 4-6 speeds up compared to the CPU implementation, reducing the computational time significantly and effectively dealing with highresolution images.
In image processing, achieving real-time performance is necessary, especially when processing video streaming or image in high resolution. A high-resolution image is a common product of satellite, aerial, biometric, and medical imaging, which is also often used in the verification and segmentation process. It is crucial to analyze the algorithm's complexity to know where it should be optimized to achieve real-time performance. High-Performance Computing (HPC) advanced technology allows the algorithm to be parallelized on Graphics Processing Unit (GPU). Parallel computation can optimize the iterative and serial procedure in an algorithm.
Researchers have been proposed parallel adaptive image thresholding methods for image segmentation. Kanungo et al. [13] proposed a parallel genetic algorithm-based adaptive thresholding for image segmentation in uneven lighting conditions. Sandeli and Batouche [14] proposed image thresholding using multilevel thresholding based on a parallel generalized island model (GIM). Nafaji et al. [15] use parallel local adaptive thresholding for binarization of documents. Upadhyay et al. [16] proposed an adaptive thresholding approach for image segmentation on GPU. All of them gained significant speedup in computational time than serial implementation. This research proposed a parallel implementation on GPU for three adaptive image thresholding methods: Otsu, ISODATA, and MCET. Our contribution lies in the parallel approach of the adaptive image thresholding method on GPU to optimize their computational times to deal with a highresolution image. This paper is organized as follows: Section 2 presents the proposed approach of parallel adaptive image thresholding methods, Section 3 presents the result and discussion, and Section 4 presents the conclusion of this work.

II. Method
Adaptive image thresholding is a method to segment images using a threshold generated from a specific algorithm. The algorithm has the purpose of obtaining an optimal threshold for segmentation. In this research, some well-known adaptive image thresholding algorithms, namely Otsu, ISODATA, and MCET are parallelized to optimize high-resolution image performance.

A. Otsu Method
Otsu method is proposed by [2] to perform automatic thresholding on the grayscale image. Otsu method iteratively searches the threshold that maximizes inter-class variance. The steps to apply Otsu threshold is described below: a) An image is converted into a normalized gray-level histogram using (1) and considered as the probability distribution where the number of pixels in ℎ gray-level is , the total number of pixels is , and the probability of ℎ gray-level is .
b) Suppose the pixels are distributed into two classes (commonly as background and foreground), for all possible thresholds = 1 … , the probability of class occurrence , the class mean level , and the inter-class variance 2 ( ) can be calculated using (2), (3), and (4), respectively. Here, ( ) and ( ) is the zeroth-order and first-order cumulative moments of the histogram and = ∑ • =1 is the total mean level of an image.
c) The select threshold maximizes 2 using (5). This threshold is the optimal threshold.' If is the number of gray levels and is the number of pixels in the image, the computational complexity of Otsu method for grayscale image segmentation is given by the following operations: a) Histogram initialization and histogram computation have a computational complexity of ( ) and ( ), respectively. b) Search the optimum threshold by maximizing the inter-class variance has a computational complexity of ( ).
c) Implementation of the Otsu threshold on the image requires computational complexity of ( ).
B. ISODATA algorithm Iterative Self-Organizing Data Analysis Technique (ISODATA) is proposed by [3] to compute the global image threshold. The method uses an iterative procedure to update the threshold. Image segmentation using the ISODATA algorithm is described as follows: a) Compute gray-level histogram from the image. b) Create initial segments by splitting the histogram into background and foreground segments using the initial threshold value 0 . c) Calculate the mean of background pixels and the mean of foreground pixels . d) Calculate a new threshold by averaging the two means value using (6).
a) Repeat the procedures c and d until the threshold value is less than a given threshold or the maximum iteration number is reached.
The computational complexity of ISODATA method for grayscale image segmentation, where is the number of gray levels and is the number of pixels in the image, is given by the following operations: a) Histogram initialization and histogram computation have a computational complexity of ( ) and ( ), respectively. b) Update the threshold until the average inter-class distance is less than a threshold or the maximum number of iterations is reached requires computational complexity of ( ), where is the number of iteration required by the algorithm.
c) ISODATA threshold Implementation on the image requires computational complexity of ( ).

C. Minimum Cross-Entropy method
The minimum cross-entropy (MCET) method is proposed by [4] to select an optimal threshold. The method searches the optimal threshold by calculating the cross-entropy for all possible thresholds and selecting the one with minimum cross-entropy. The procedure to apply the minimum crossentropy method for image segmentation is described below: a) Compute normalized gray-level histogram from image using (7) where the number of pixels in gray-level is , the total number of pixels is , and the probability of gray-level is .
= / b) Initialize the entropy of gray-level histogram using (8), where and are the minima and maximum gray-level intensity.
c) Suppose the pixel is distributed into two classes: background and foreground with a threshold .
If the mean of pixel distribution below the threshold (background) is and the mean of pixel distribution above the threshold (foreground) is , then for all possible thresholds, = … calculate the cross-entropy of pixel distribution below and above the threshold using (9).
d) Select the optimal threshold corresponding to the minimum of the cross-entropy using (10).
If is the number of gray levels and is the number of pixels in the image, the computational complexity of MCET method for grayscale image segmentation is given by the following operations: a) Histogram initialization and histogram computation have a computational complexity of ( ) and ( ), respectively. b) Select the minimum cross-entropy from all possible thresholds has a computational complexity of ( 2 ).
c) Implementation of MCET threshold on the image requires computational complexity of ( ).

D. Parallel Computing on GPU
GPU (Graphics Processing Unit) is a high-level parallel architecture used to do a fast operation in computer graphics, and now it can be used other than graphics, which is known as GP-GPU (General Purpose-Graphics Processing Unit) [17]. The well-known general-purpose parallel computing platform and programming model is Compute Unified Device Architecture (CUDA) from NVidia.
GPU is highly parallel, multithreaded, has many cores processors, and has very high memory bandwidth. The difference between how CPU and GPU process the data is shown in Figure 1(a) and Figure 1(b). GPU devotes more transistors to data processing than caching and flow control. GPU is built on an array of Streaming multiprocessors (SM), and it is organized into grids, blocks, and threads.
Data-parallel processing maps data elements to parallel processing threads. Figure 1(c) shows the parallel processing threads in GPU. A multithreaded program is partitioned into blocks of threads that execute independently from each other. Therefore, using GPU, the computation of adaptive image thresholding algorithms will be parallel processed, reducing computational time.
Using the advantages of GPU's parallel architecture, the adaptive image thresholding methods that involve histogram calculation, cumulative sum, search the minimum or maximum value from an array can be optimized using parallel reduction and parallel prefix sum (scan) algorithms.

1) Parallel Reduction Algorithm
A parallel reduction algorithm can optimize the computation of an array's sum, minimum and maximum value. Parallel reduction allows iteration from half of the total number of bin histograms processed parallel with a computational complexity of (log( )) in the shared memory. Every half of the total number of bin histograms is summed (sum reduction) or compared (min or max reduction) to the other half. The process is reduced to half every iteration until all of the element is processed. Loop unrolling can optimize the thread when the processed data is within the thread warp. The illustration of the parallel sum reduction algorithm is shown in Figure 2.

2) Parallel Prefix Sum (Scan) Algorithm
A parallel prefix sum (scan) algorithm can be used to calculate the cumulative sum of the histogram on shared memory. The procedure of parallel prefix sum (scan) algorithm is described as follow: a) Up-sweep (reduction) phase, sum every bin in the histogram with the bin on its right according to its stride. This step has a computational complexity of (log( )). The illustration of the up-sweep (reduction) phase is shown in Figure 3.
b) Set the last bin in the histogram to zero. c) Down-sweep phase, sum every bin in the histogram with the bin on its right according to its stride. This step also has a computational complexity of (log( )). The illustration of the down-sweep phase is shown in Figure 4.
The parallel prefix sum (scan) algorithm has a computational complexity of (2 log( )) where the (log( )) is in the up-sweep phase and the down-sweep phase.

III. Result and Discussion
The computational time of adaptive image thresholding algorithms on GPU has been tested on FVC2004 (Fingerprint Verification Competition) dataset [20]. The dataset consists of several fingerprint images. Selected images in the dataset are resized into various sizes using the bi-cubic interpolation method. The proposed approach is built using C++ with an additional CUDA library and runs on Intel Core i7-7700HQ 2.8GHz processor, 16 GB of RAM, and NVidia GeForce GTX 1050. The GPU has Pascal architecture with five streaming multiprocessors and computes capability 6.1.

A. Adaptive Image Thresholding Implementation
In this research, three adaptive image thresholding algorithms are implemented on GPU: Otsu, ISODATA, and MCET. The parallel approach of the three methods is similar except finding the optimum threshold to perform binarization. First, image data must be copied from host to the device memory. Several kernels to compute histogram, probability histogram, and cumulative histogram to find the optimal threshold and apply the threshold in the image are used. Finally, the binary image result is copied back to the host from device memory. The implementation of Otsu, ISODATA, and MCET methods on GPU is shown in Algorithm 1.
As shown in Algorithm 1, the parallel approach of the adaptive image thresholding method uses several kernels to perform a specific operation, will keep short computation runs on streaming multiprocessors and increase its availability. The number of threads per block and the block per grid can be configured to run the kernel effectively. It is also suitable for error handling because it can be monitored on each kernel execution. CASE OTSU threshold ← find threshold that maximizes inter-class variance from cumulative sum histogram CASE ISODATA threshold ← update the threshold until the average inter-class distance is less than a given threshold or the maximum number of iteration is reached CASE MCET above-threshold and below-threshold means ← compute above-threshold and below-threshold means from cumulative sum histogram Fig. 4. Illustration of down-sweep phase [19] cross-entropy histogram ← compute cross-entropy histogram from above-threshold and belowthreshold means threshold ← compute the index of minimum cross-entropy from cross-entropy histogram END SWITCH binary image ← apply threshold to image data COPY binary image from device (GPU) to host (CPU) The highest computational complexity is ( ) which lies in the histogram computation and image thresholding step. The parallel implementation of these steps will reduce the computational complexity because the work is computed at once and distributed to the total number of threads used for computation. The parallel approach of histogram computation on GPU is shown in Algorithm 2.
Histogram computation uses the atomic addition function from CUDA and utilizes shared memory to store the partial histogram, which will reduce the queue at the addition instruction level to the number of threads block. The partial histogram in shared memory is then merged parallel to the histogram in global memory. This operation also uses atomic addition, which will reduce the queue at the addition instruction level to the number of blocks in a grid.
Without partial histogram computation in shared memory, the histogram computation is likely to have long queues and be forced to perform serial computation. All operations that equal the number of data need to access and performed in addition to one specific bin in the histogram. For the graylevel histogram, the number of histogram bins is fixed to 256. The queue is proportional to the data and their distribution in the image. With partial histogram computation, the queue is reduced to the number of threads and blocks used.

END FUNCTION
After the histogram is obtained, the probability histogram is computed by simple division. Otsu method uses the probability of 0 th order histogram, computed by dividing the value in every bin with the total number of data and the probability of 1 st order histogram computed by multiplying 0 th order histogram with the corresponding gray level. MCET method uses the probability of an entropy histogram computed by multiplying first-order histogram with the gray level log. ISODATA method uses the 0 th order histogram and 1 st order histogram. The kernel configuration is a block with 256 threads to calculate the 256-bins histogram.
The computation of the cumulative sum of histogram uses a parallel prefix sum (scan) algorithm. The computational complexity can be reduced to (2 log( )) from ( ). To avoid bank conflict, it utilizes half of the histogram bin's total number as thread block and some offsets. Bank conflict occurs when two or more threads want to access the same bank memory address, forcing serial access to memory. With proper offsets, bank conflict can be avoided. The computation of the cumulative sum of a histogram is shown in Algorithm 3.  // load data to shared memory ALLOCATE shared memory (smem) to store the cumulative sum of histogram the (p + offset1) th index of smem cumulative sum of histogram ← the p th index of probability histogram the (q + offset2) th index of smem cumulative sum of histogram ← the q th index of probability histogram SYNCHRONIZE the threads // up-sweep (reduction) phase FOR d = n >> 1 TO d > 0 DO SYNCHRONIZE the threads IF t < d THEN p ← offset * (2 * t + 1) -1 q ← offset * (2 * t + 2) -1 p ← p + p >> 4 q ← q + q >> 4 the q th index of smem cumulative sum of histogram ← the q th index of smem cumulative sum of histogram + the p th index of smem cumulative sum of histogram END IF offset ← offset * 2 d ← d >> 1 END FOR // set the last element to zero IF t = 0 THEN the (n -1) th index of cumulative sum of histogram ← the (n -1 + (n -1) >> 4) th index of smem cumulative sum of histogram the (n -1 + (n -1) >> 4) th index of smem cumulative sum of histogram ← 0 END IF // down-sweep phase FOR d = 1 TO d < n DO offset ← offset >> 1 SYNCHRONIZE the threads IF t < d THEN p ← offset * (2 * t + 1) -1 q ← offset * (2 * t + 2) -1 p ← p + p >> 4 q ← q + q >> 4 temp value ← the p th index of smem cumulative sum of histogram the p th index of smem cumulative sum of histogram ← the q th index of smem cumulative sum of histogram the q th index of smem cumulative sum of histogram ← the q th index of smem cumulative sum of histogram + temp value END IF d ← d * 2 END FOR // copy data from shared memory to global memory the p th index of cumulative sum of histogram ← the (p + 1 + (p + 1) >> 4) th index of smem cumulative sum of histogram IF q < n -1 THEN the q th index of cumulative sum of histogram ← the (q + 1 + (q + 1) >> 4) th index of smem cumulative sum of histogram END IF

END FUNCTION
Computation to find the optimal threshold from the cumulative sum of the histogram is different for each method. However, a block with 256 threads is used to match the number of histogram bins because all methods are based on a histogram. Otsu method finds a threshold that maximizes interclass variance can be achieved using a parallel reduction algorithm to find the index of maximum inter-class variance. Algorithm 4 shows the computation of inter-class variance on GPU.

FUNCTION compute the inter-class variances
READ cumulative sum of 0 th order and 1 st order probability histogram t ← threadIdx.x n ← the number of histogram bin // load data to shared memory ALLOCATE shared memory (smem) to store the cumulative sum of 0 th and 1 st order probability histogram smem cumulative sum of 0 th order histogram ← cumulative sum of 0 th order probability histogram smem cumulative sum of 1 st order histogram ← cumulative sum of 1 st order probability histogram smem value ← 0 smem index ← t SYNCHRONIZE the threads // compute inter-class variances numerator ← power of two of (the (n -1) th index of smem cumulative sum of 1 st order histogram * the t th index of smem cumulative sum of 0 th order histogram -the t th index of smem cumulative sum of 1 st order histogram) denominator ← the t th index of smem cumulative sum of 0 th order histogram * (1 -the t th index of smem cumulative sum of 0 th order histogram) + EPSILON) the t th index of smem value ← numerator / denominator SYNCHRONIZE the threads // find the index of maximum value of inter-class variance using parallel reduction algorithm FOR s = blockDim.x / 2 TO s > 0 DO IF t < s AND the (t + s) th index of smem value > the t th index of smem value THEN the t th index of smem index ← the (t + s) th index of smem index the t th index of smem value ← the (t + s) th index of smem value END IF SYNCHRONIZE the threads s ← s >> 1 END FOR // get the index of maximum value and copy to global memory IF t = 0 THEN threshold ← the 0 th index of smem index END IF

END FUNCTION
At each iteration in the ISODATA method, the threads compute the average data below and above the threshold, compute the new threshold, and compare the new threshold with the previous threshold. If the difference of the thresholds is less than a given threshold or the iteration is reached the maximum number of iterations, the optimum threshold is obtained. Algorithm 5 shows the ISODATA computation on GPU. // load data to shared memory ALLOCATE shared memory (smem) to store the cumulative sum of 0 th order and 1 st order histogram smem cumulative sum of 0 th order histogram ← cumulative sum of 0 th order histogram smem cumulative sum of 1 st order histogram ← cumulative sum of 1 st order histogram smem means below threshold ← 0 smem means above threshold ← 0 smem value ← 0 SYNCHRONIZE the threads // compute all possible means below-threshold and above-threshold IF t < n -1 THEN the t th index of smem means below-threshold ← floor ((the t th index of smem cumulative sum of 1 st order histogram / (the t th index of smem cumulative sum of 0 th order histogram + EPSILON)) + 0.5) numerator ← the (n -1) th index of smem cumulative sum of 1 st order histogram -the (t + 1) th index of smem cumulative sum of 1 st order histogram denominator ← the (n -1) th index of smem cumulative sum of 0 th order histogram -the (t + 1) th index of smem cumulative sum of 0 th order histogram + EPSILON the t th index of smem means above-threshold ← floor ((numerator / denominator) + 0.5) END IF SYNCHRONIZE the threads // compute the average inter-class means the t th index of smem value ← floor (((the t th index of smem means below-threshold + the t th index of smem means above-threshold) / 2) + 0.5) SYNCHRONIZE the threads // compute the difference between the current threshold and the previous threshold IF t = 0 THEN iteration ← 0 difference ← 1 T ← floor ((the (n -1) th index of cumulative sum of 1 st order histogram / (the (n -1) th index of cumulative sum of 0 th order histogram + EPSILON)) + 0.5) WHILE difference > 0 AND iteration < maximum number of iteration DO threshold ← the T th index of smem value difference ← absolute of (the T th index of smem value -threshold) T ← the T th index of smem value iteration ← iteration + 1 END WHILE END IF SYNCHRONIZE the threads

END FUNCTION
The cross-entropy computation uses a parallel sum reduction algorithm to compute the sum abovethreshold and below-threshold entropy from the histogram. The sum is used to compute the entropy histogram. To compute all possible thresholds in parallel (iterates through all possible thresholds while performing parallel sum reduction algorithm to compute the sum above-threshold and belowthreshold), the configuration is set to use a block with 256 threads and a grid with 256 blocks. Algorithm 6 shows the cross-entropy computation on GPU. Algorithm 6. The computation of cross-entropy on GPU.

GPU CONFIGURATION
block ← 256 grid ← 256 FUNCTION compute the cross-entropy READ 0 th order probability histogram, cumulative sum of 0 th order and 1 st order probability histogram b ← blockIdx.x t ← threadIdx.x n ← the number of histogram bin // load data to shared memory ALLOCATE shared memory (smem) to store the sum and entropy below-threshold and abovethreshold data below-threshold ← the b th index of cumulative sum of 1 st order probability histogram / (the b th index of cumulative sum of 0 th order probability histogram + EPSILON) data above-threshold ← (the (n-1) th index of cumulative sum of 1 st order probability histogram -the b th index of cumulative sum of 1 st order probability histogram) / (the (n-1) th index of cumulative sum of 0 th order probability histogram -the b th index of cumulative sum of 0 th order probability histogram + EPSILON) the t th index of smem below-threshold entropy ← 0 the t th index of smem above-threshold entropy ← 0 SYNCHRONIZE the threads // compute entropy above-threshold and below-threshold IF t > b AND data above-threshold > 0 THEN the t th index of smem above-threshold entropy ← (t + 1) * the t th index of 0 th order probability histogram * log of (data above-threshold) END IF IF t <= b AND data below-threshold > 0 THEN the t th index of smem below-threshold entropy ← (t + 1) * the t th index of 0 th order probability histogram * log of (data below-threshold) END IF SYNCHRONIZE the threads // perform parallel sum reduction FOR s = b / 2 TO s > 0 DO IF t < s THEN the t th index of smem above-threshold entropy ← the t th index of smem above-threshold entropy + the (t + s) th index of smem above-threshold entropy the t th index of smem below-threshold entropy ← the t th index of smem below-threshold entropy + the (t + s) th index of smem below-threshold entropy END IF SYNCHRONIZE the threads END FOR // compute cross-entropy IF t = 0 THEN the b th index of cross-entropy histogram ← global entropy -the 0 th index of smem abovethreshold entropy -the 0 th index of smem below-threshold entropy END IF

END FUNCTION
Finding the index of minimum cross-entropy can be done using a parallel reduction algorithm that compares half of the histogram bins with the other half of the histogram bins. The number of histogram bins is reduced for every iteration. Algorithm 7 shows the computation to find the index of minimum cross-entropy on GPU.
Algorithm 7. The computation to find the index of minimum cross entropy on GPU.

GPU CONFIGURATION
block ← 256 grid ← 1 FUNCTION find the index of minimum cross-entropy READ cross-entropy histogram t ← threadIdx.x // load data to shared memory ALLOCATE shared memory (smem) to store the cross-entropy histogram smem cross-entropy histogram ← cross-entropy histogram smem index ← t SYNCHRONIZE the threads // find index of minimum value using reduction FOR s = blockDim.x / 2 TO s > 0 DO IF t < s AND the (t + s) th index of smem cross-entropy histogram < the t th index of smem cross-entropy histogram THEN the t th index of smem cross-entropy histogram ← the (t + s) th index of smem crossentropy histogram the t th index of smem index ← the (t + s) th index of smem index END IF SYNCHRONIZE the threads s ← s >> 1 END FOR // copy the result to global memory IF t = 0 THEN threshold ← the 0 th index of smem index END IF

END FUNCTION
The implementation of image thresholding is parallelized using thread-level parallelism on GPU. The approach is practical because the operation is independent for each pixel. The result of image thresholding is a binary image "1" for pixels above the threshold and "0" for pixels below the threshold. Algorithm 8 shows the implementation of image thresholding on GPU.

B. Adaptive Image Thresholding Result
The parallel adaptive image thresholding method is tested on selected images from the FVC2004 (Fingerprint Verification Competition) dataset [20]. The result of adaptive image thresholding implementation is the binary image as shown in Figure 5 where (a) is the fingerprint image, (b) is the binary image generated by the Otsu method with threshold = 154, (c) is the binary image generated by ISODATA method with threshold = 156 and (d) is the binary image generated by MCET method with threshold = 123. As shown in Figure 5, the methods produce a different optimal threshold because the algorithm to search the optimum threshold is also different.

C. Computational Time Evaluation
The test was conducted on selected images from FVC2004 (Fingerprint Verification Competition) dataset [20]. The images are resized to generate various image sizes, namely 256×256, 512×512, 1024×1024, 2048×2048, and 4096×4096. The purpose of this experiment is to measure the computational time of the proposed parallel approach of adaptive image thresholding methods when dealing with a large number of data (pixels).
The computational time evaluation on CPU and GPU is shown in Figure 6 where (a) Otsu method, (b) ISODATA method, and (c) MCET method. The proposed parallel approach gains speedup 4-6 times than CPU implementation from implementing adaptive image thresholding methods on GPU. The performance significantly increases when dealing with larger data. The result shows that the parallel approach of the adaptive image thresholding method on GPU allows image segmentation to be processed in real-time, even when dealing with a large resolution of the image.

IV. Conclusion
Image processing applications, for example, perform segmentation, usually requiring highresolution images such as satellite, aerial, biometric, or medical images as the input. The segmentation method, which involves per pixel operation and iterative procedure, can be costly in handling many data/pixels in the high-resolution image. Therefore, this research proposed a parallel approach of adaptive image thresholding algorithms, namely Otsu, ISODATA, and minimum cross-entropy on GPU to deal with high-resolution images. The experiment was conducted on selected fingerprint images taken from FVC2004 (Fingerprint Verification Competition) dataset. From the experiment with the various scale of image resolutions, GPU implementation's computational time shows 4-6 times more speed up than CPU implementation. The performance is significantly increased when dealing with larger image resolution. This result shows that the parallel approach allows image segmentation to be processed in real-time, even when dealing with large image resolution. The contributions are shown in the analysis result of the adaptive image thresholding algorithms that can be optimized using the parallel approach to produce a significant speedup in a computational time when dealing with a high-resolution image. In future work, the proposed parallel approaches will be further optimized using multi-GPUs and implemented in more complex cases such as the segmentation of aerial or medical images.