Backpropagation Neural Network with Combination of Activation Functions for Inbound Traffic Prediction

Predicting network traffic is crucial for preventing congestion and gaining superior quality of network services. This research aims to use backpropagation to predict the inbound level to understand and determine internet usage. The architecture consists of one input layer, two hidden layers

The architecture of backpropagation neural networks still becomes the preferable topic in neural networks, such as optimization in the number of hidden layers [16] [17], and another research in sensors, modeling backpropagation based on GA by specifying the number of hidden layer neurons [18]. Despite its slow training, a backpropagation neural network is easy to use and design depending on the input characteristics, whether univariate or multivariate inputs [19]. Thus, backpropagation is proposed to predict inbound traffic to understand and determine the internet usage through the network. Three different activation functions are implemented, i.e., sigmoid, ReLU, and tanh function. The implementation is both in a single form and combination, making up nine permutation models to optimize the weights between layers.

A. Backpropagation Neural Network
The neural network is a reliable nonlinear technique for modeling a wide range of applications due to the flexibility in terms of architecture. The neural network architecture could be two or more layers. Neural network applied in this study is backward propagation of errors or backpropagation, a supervised learning algorithm in neural networks, which is a multilayer perceptron. Backpropagation in the networks is simply a gradient descent method aiming to optimize the weights connecting the adjacent layers among the input layer, hidden layer(s), and output layer. By the optimized weights, the errors between the observed data and the prediction can be minimized. Figure 1 shows a neural network 2-hidden-layer. This study uses this architecture since two hidden layers are more superior to those with one hidden layer [20]. This architecture was applied with dense networks, which means that each unit (node) in a layer is densely connected with all other units in the neighboring layers. Each connection is associated with a weight ( ) reflecting the strength of the connection between the units. The inputs of 1 , 2 , … , the hidden unit value (ℎ ) is determined by applying a weighted sum of all inputs plus a bias as written in (1), while the output unit ( ) is defined by (2) [21].
In a backpropagation neural network, firstly, the signal propagates forward from the input layer to the output layer through the hidden layer. After that, the error is calculated, moving vice versa from the output layer to the input layer through the hidden layer. After the iterative training process, the neural network achieves the optimal weight and threshold to reduce the error to the desired level [15]. The weight parameter is updated with the rate change as in (3), where denotes the observed data while ̂ is the predicted values. Furthermore, itself is an activation function to map the values to a nonlinear ( Figure 2). Sigmoid function is one of commonly used mainly for forecasting probability-based-output as expressed in (4), and ReLU is another function widely used representing a nearly linear function and preserving the properties of linear models that made them easy to optimize, with gradient-descent method [22] as given by (5). Besides, hyperbolic tangent, known as tanh function, is a zero-centered function that provides better training performance for multilayer neural networks formulated as in (6).

B. Experimental Design
This paper collected a time series network inbound traffic data from a backbone network using CACTI and a traffic controller applied in Mulawarman University in Indonesia. The series was weekdays inbound traffic accounted daily ranging from 27 August 2019 to 17 February 2021.
Traffic data measured in bits/second were then normalized on a scale of 0 to 1 to prevent huge numbers in the process of BPNN. The study uses the first 80% as training data, and the rest is testing data. Figure 3 illustrates the research flow implemented in this paper, whereas each hidden layer is an applied activation function. There were three various activation functions used, i.e., sigmoid, ReLu, and tanh function designed in a single form and combination, making up nine permutation models in terms of order, i.e., the usage of pure sigmoid function; ReLU function; tanh function; sigmoid-ReLU; sigmoid-tanh; ReLU-Sigmoid; ReLu-tanh; tanh-sigmoid; and tanh-ReLU function.
Furthermore, Table 1 depicts the preferable setting of BPNN utilized. In order to conduct a comparative analysis in the usage if learning rate, this paper was designed with three kinds of learning rate, i.e., 0.1; 0.5; and 0.9 reflecting a low, middle, and high rate, respectively.

C. Accuracy Metrics
In terms of accuracy comparison, mean square error (MSE) and root mean square error (RMSE) were used as expressed in (7) and (8) consecutively, where denotes the observed data while ̂ is the predicted values. The smaller the value, the less the error is.

III. Results and Discussions
Nine permutation models of activation function were used in this paper with three various learning rates. Table 2 and Table 3 show the MSE and RMSE values for each model and learning rate used based on the simulation performed. Overall, it can be clearly seen that in terms of the usage of a single activation function, although the usage of pure sigmoid function provided the smallest RMSE, the RMSE gained from the three models were not significantly different.
The results of single form activation function are shown in Figure 4, Figure 5, and Figure 6 for Sigmoid, ReLU, and Tanh respectively. The usage of Sigmoid-Sigmoid and Tanh-Tanh function could not recognize higher traffic pattern. On the contrary, ReLU-ReLU worked more superior in terms of pattern recognition although the RMSE was not the smallest one, but it was still more superior than Tanh function. On the other hand, when it comes to combination form between two different activation functions in the architecture, sigmoid could not recognize the high pattern properly whether in a single form or combination as presented in Figure 4, Figure 7, and Figure 8, unless it was mixed with ReLU, with ReLU placed in the first order as shown in Figure 9.
ReLU is remained powerful whether combined with sigmoid or tanh function in the architecture by implementing it in the first order. Combined ReLU in the first order can be seen in Figure 9 and Figure 10.
In terms of order, the performance of tanh function seems similar with ReLU that the accuracy result was not significantly changed whether in a single form usage or in combination as long as it was put in the first place then followed by other activation functions as illustrated in Figure 11 and Figure 12.

IV. Conclusion
Backpropagation neural network is applied to predict the inbound traffic designed in one input layer, two hidden layers, and one output layer with three kinds of activation functions, i.e., sigmoid function, rectified linear unit (ReLU), and hyperbolic Tangent (tanh) function. The design is used in single and combination forms obtaining nine permutations with three kinds of learning rates, i.e., 0.1; 0.5; and 0.9 representing a low, middle, high rate.
Based on the result, it can be seen that ReLu works more superior in recognizing the inbound traffic pattern than sigmoid and tanh function in the similar architectures and parameters used in the analysis. Hence, an interesting result could be concluded that in regards to the usage of two different activation functions in BPNN architecture, the selection of first-order activation function is crucial in order to gain superior prediction result and ReLU function is recommended to be used in the initial order to catch the high pattern in the data. In addition, in terms of predicting upper traffic utilization, the combination of a high learning rate and pure ReLU or a combination of ReLu-sigmoid or ReLu-Tanh is more suitable and recommended.
As for future work, it is recommended to optimize the architecture and parameters, particularly in the number of neurons in the hidden layers and learning rate, respectively. Nevertheless, overfitting and convergence could be problems encountered in the process so that a proper architecture, activation function's order, and parameter determination should be carefully performed.