Face Images Classification using VGG-CNN

ABSTRACT

Image classification is a fundamental problem in computer vision. In facial recognition, image classification can speed up the training process and also significantly improve accuracy. The use of deep learning methods in facial recognition has been commonly used. One of them is the Convolutional Neural Network (CNN) method which has high accuracy. Furthermore, this study aims to combine CNN for facial recognition and VGG for the classification process. The process begins by input the face image. Then, the preprocessor feature extractor method is used for transfer learning. This study uses a VGG-face model as an optimization model of transfer learning with a pre-trained model architecture. Specifically, the features extracted from an image can be numeric vectors. The model will use this vector to describe specific features in an image. The face image is divided into two, 17% of data test and 83% of data train. The result shows that the value of accuracy validation (val_accuracy), loss, and loss validation (val_loss) are excellent. However, the best training results are images produced from digital cameras with modified classifications. Val_accuracy's result of val_accuracy is very high (99.84%), not too far from the accuracy value (94.69%). Those slight differences indicate an excellent model, since if the difference is too much will causes underfit. Other than that, if the accuracy value is higher than the accuracy validation value, then it will cause an overfit. Likewise, in the loss and val_loss, the two values are val_loss (0.69%) and loss value (10.41%).
This is an open access article under the CC BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/).

Keywords:
This study aims to determine the CNN method's facial images classification. In this study, the pretrained model used is the VGG-face model [8]. This significant result was obtained using 16-19 layer weights. The classification modeling in this study is changing the last layer on CNN.

II. Method
In this study, there are several processes to achieve the expected result. Those processes are data collection, map feature extraction, classification modeling, and result validation testing. Therefore, this study using KomNet dataset [13], and the face image used is 36600 face images with a size of 224×224 pixels. In computer vision, transfer learning is commonly expressed through the use of a pretrained model. A typical implementation used is to import and use models from existing libraries.
The next step is to create a new convolutional neural networks (CNN) model for image classification using multiclass CNN. This image classification model is generated from the transfer learning approach, which is based on the CNN pre-trained model [14]. In general, CNN proved to be superior in a variety of computer vision tasks [15]. Convolutional Networks (ConvNets) have shown excellent performance in handwritten digital classification and face detection [16]. Figure 1 is an outline of the CNN processes in the system. The process begins by inputting the face image. The method used for transfer learning is the feature extractor preprocessor. This study uses an optimization model of transfer learning with a pre-trained model architecture, the VGG-face model. Mainly, the features extracted from an image can be numeric vectors. The model will use this vector to describe specific features in an image. The reason for the VGG-face model selection because it is perfect for producing facial feature extraction [17]. The Feature Extraction has a VGG-face 16 layer architecture. After performing the VGG-face model, the last layer of the VGG-face will be modified to achieve the maximum result. Figure 2 presents the VGG-face architecture with the last three layers are the classifications to be modified. The first layer features are general, and the last layer features specific, so there must be a transition from general to specific somewhere on the network [18]. The pre-trained strategy leaves some initial layers unprocessed and trains the final layers to avoid overfitting [19]. The initial layer is for convolution or feature extraction, while the last layer is for classification. The last three layers are The last layer of the VGG-face model is the one that is wholly connected before the output layer. These layers will provide a complex set of features for describing an input image and provide useful input when training a new image classification model.
After the pre-trained model from the last VGG-face layer is loaded, the next step is to create a data train and data test. It consists of five stages, and the first stage is to change the existing wavelet feature in the train or test folder with a target size of 224 (224×224 pixels). At the second stage, it needs to change the image into an array. Next, the third stage is inputting the results into the last VGG-face. Then, the results are entered into the train data array and the test data array. After that, the last stage is to repeat step one until all face images in the train or test folder have been read After the model is made, the next step is using epoch 100 in the training process. In this process, the weight value will be obtained, which is stored in a file in h5 format. Tests are performed to obtain a validation test of the results through training facial images from several devices. The results of image testing from several devices are displayed in graphical form.

III. Results and Discussions
The use of massive data is necessary to produce an ideal result. Moreover, the pre-trained model is a conversion model provided by TensorFlow or Keras. This pre-trained model can be used directly from the VGG-face Keras library. After the model is made, the next step is the training process with epoch 100. Epoch 100 limits the iteration of large amounts of data that takes a long time to train in one training session. However, the deep learning method has a weakness with the long training process when using a server computer. It can be overcome using Graphical Processing Unit (GPU) technology [9] [20]. This study using GPU, which Google Colab owns for the training process. The results of the training process are the weight value which is stored in an h5 format file. The train results with epoch 100 with three sources of face images are presented in Table 1. Table 1 shows that the results of facial image training are from three devices at epoch 100. In the training process, facial images are divided into two, which are 17% as data test and 83% as data train. The accuracy value, which are val_accuracy, loss, and val_loss values, are impressive. However, the best training result is the image that comes from a digital camera with a modified classification. The val_accuracy result is very high (99.84%), not too far from the accuracy result (94.69%). The difference in value that is not too significant indicates a great model. It is because if the difference is too far, it will cause an under fit. Other than that, if the accuracy value is higher than the accuracy validation value, it will cause an over fit. Moreover, the val_loss result is very low (0.69%) and the loss value is 10.41%. The error or loss value is the smallest compared to the others, which means that the model is ideal and proper to be used as a prediction.
The training results from start to finish are presented in a graphic image. Figure 3 shows a graph of the facial image training result on Epoch 100 with modified classification. Figure 3 shows that the model (the modification of the last three layers) is great and ideal since the value differences are insignificant. Likewise, the difference between val_loss and loss is relatively small, and the values are close.

IV. Conclusion
This study performed a pre-trained model using the VGG-face architecture to modify the last three layers or the classification section. The model provides a result of very high accuracy. Also, the resulting loss is shallow. It is indicated that the model is great and ideal for prediction. Moreover, the image data for training are obtained from three sources. Based on the three image sources, the best source is from the digital camera with accuracy = 94.69%, and loss = 10.41%. Therefore, further research needs to focus on the quality of camera image sources to improve the classification performance optimally.