Comparison of Indonesian Imports Forecasting by Limited Period using SARIMA Method

Indonesia is a country with rapid economic growth. Good economic growth is one of the national benchmarks capable of giving welfare to its people. Economic growth in Indonesia, especially for international trade in exports and imports, is one of the largest. Import is an activity to enter goods from another country into the Indonesian customs area. Import has three types of materials that are often needed by Indonesian society such as, consumption goods, raw materials, and capital goods.

The development of Indonesia's imports fluctuate over years. Inability to anticipate such rapid changes can cause economic slump due to inappropriate policy. For instance, recent years imports in rice led to the extermination of rice reserves. The reason is to maintain the market price of rice in Indonesia. To overcome these changes, forecasting the amount of imports should assist the Government in determining the optimum policy. This can be done by utilizing an algorithm to forecast time series data, in this case the amount of imports in the next few months with a high degree of accuracy. This study uses data obtained from the official website of the Indonesian Ministry of Trade. Then, Seasonal Autoregressive Integrated Moving Average (SARIMA) method is applied to forecast the imports. This method is suitable for the interconnected dependent variables, as well as in forecasting seasonal data patterns. The results of the experiment showed that 6-period forecast is the most accurate results compared to forecasting by 16 and 24 periods. The research resulted in the best model, that is ARIMA (0, 1, 3)(0, 1, 1) 12 produces forecasting with a MAPE value of 7.210 % or an accuracy rate of 92.790 %. By applying this imports forecast model, the government can have a forward strategic plans such as selectively imports products and carefully decide the amount of the incoming products to Indonesia. Hence, it could maintain or improve the economic condition where local businesses can grow confidently.
These inconsistent import developments can be anticipated by forecasting imports in the future periods. By using the assistance of a forecasting method, the result of forecasting can then be used by the Government as a consideration material to take a new policy or step in reducing the outcome of imports so that the economy in Indonesia is better. The main thing to note in forecasting is the level of accuracy of the methods done.
Several research to forecast import result has been that is, Forecasting Iron Ore Import and Consumption of China Using Grey Model Optimized by Particle Swarm Optimization Algorithm [1]. This research concluded that proposed hybrid-model performs better than the results obtained by a single method such as basic GM(1,1), PSO-GM(1,1), or rolling GM(1,1). The PSO-rolling GM(1,1) approach to modeling iron ore imports and consumption in China is both reliable and efficient. The prediction accuracies of the proposed model for imports and consumption have reached 3.2 % and 2.3 % respectively. Then some research that has been done for forecasting using the ARIMA method has been carried out, among others Identifying an Appropriate Forecasting Model For Forecasting Total Import of Bangladesh [2]. The research produced the best model, which is ARIMA(0,1,1)(1,0,0) 12 with an MSE value of 15747374 and MAPE value of 22.97802 %. Next Forecasting International Tourism Demand in Malaysia Using Box Jenkins SARIMA Application [3]. The research produced the best model, which is ARIMA(1,0,1) model with RMSE value is 0.2914, MAE value is 0.2075 and MAPE value of 1.4319 %.
SARIMA is a development of ARIMA models that have seasonal patterns in their data. ARIMA is one of the forecasting models that fully ignores independent variables and uses dependent variables where data is interconnected. The advantages of the ARIMA method are being able to produce highly accurate forecasting in forecasting short-term, flexible and can represent a wide range of time series characters occurring in the short term, and can analyze random, trending, and seasonal data situations. Especially for data that has seasonal patterns such as Indonesian import data, the exact method is Seasonal Autoregressive Integrated Moving Average (SARIMA).
Based on the problem in the import trade, this research will use SARIMA method to forecast Indonesian imports. The SARIMA method is chosen because it is capable of predicting time series data and generating high levels of accuracy for short-term forecasting. So by using SARIMA method is expected to produce good forecasting and become a step to the development of innovation and the establishment of a strategic plan in determining the ledge to reduce the outcome of imports.

II. Materials and Methods
The research is divided into 5 main phases (shown in Figure 1), namely data collection, preprocessing, model candidate determination, model assessment and evaluation, and best model determination.

A. Data Collection
The dataset used in this research is sourced from the official website kemendag.go.id. The Website of Ministry of Trade of the Republic of Indonesia (KEMENDAGRI) is a site that provides various information about trading in Indonesia, such as the development of exports and imports, trade balance, foreign exchange rate against rupiah, inflation, and other trading activities. It contains 211 Indonesia's import data from 1996 until July 2019. The dataset has 5 attributes, which is Year, Total, Consumption Goods, Raw Material Support, and Capital Goods.

B. Preprocessing 1) Attribute Removal
Attribute Removal isa trivial process by eliminating or removing unused attributes for the forecasting process. The original data consists of time-series occasions of imports in Indonesia. However, forecasting import only requires the time attribute as the independent variable (x-axis) and import amount as the target output (y-axis), the remaining attributes were removed from the dataset. The removal process was done manually via spreadsheet application. In addition, the resulting dataset was converted to dd-mm-yyyy format. Then, the order of the dataset was reversed to an increasing time order.

2) Stationary Test
Stationarity of a data means that the statistical attributes of the time-series data has not change over time. One can illustrate that there is a constant progression of the graph. It is similar to a linear model, but not a constant one. As time progresses, the linear function constantly changes. It has a constant slope, a value representing the rate of change. So, time series with seasonal occasions or trends are not stationary. In contrast, a stationary time series contains no foreseen patterns in the long-term. In this case, forecasting becomes impossible because wherever any point one observed, there exist relatively the same values.
A stationary test is performed to determine whether the data is stationary or not [4]. Stationary tests can be performed in two ways. The first way is by viewing the graph of dataset, if the graph fits to a straight line or the average of a chart is close to zero then the dataset is already stationary. The second way is to see the Auto-Correlation Function (ACF) and Partial Auto-Correlation Function (PACF) plots on the dataset. The ACF plot is used to measure the comparison between time series data and time lag. PACF plot is used to measure the amount between a variable and the time lag. If the plot of ACF and PACF display a change in the value between lag which is evident in the form of cut off and dies down then the dataset is stationary.

3) Differencing
Differencing is a technique to make the time-series data stationary as a requirement for the SARIMA model. So, differencing only applies for the non-stationary time-series data. Differencing removes the series dependencies on time, this includes structures like trends and seasonality. A non stationary time series would not be suitable to be forecasted. Therefore, differencing is done by calculating the change or difference between the subsequent observations. The value of difference obtained is checked back whether it is stationary, otherwise, it will repeat the process. Equation (1) shows a formulate for the differenciation process between Yt and Yt-1 [5].
= − (1) More sequence differences are calculated in the same way. For example, the second sequence difference (d = 2) is only expanded to include the second lag of the series, as follows [5].
The number of differencing processes that have been done will determine the order of the coefficient d which is then used to determine the candidate models such as Auto Regressive (AR) and Moving Average (MA).

C. SARIMA
Seasonal Autoregressive Integrated Moving Average (SARIMA) is a development of the ARIMA model that has seasonal patterns in their data. Seasonal patterns are patterns that experience a loop at each season, such as Weekly, Monthly, quarterly, yearly, and so on. ARIMA is a method developed by George Box and Gwilyn Jenkins in 1970 and commonly referred to as the Box-Jenkins Method [6] [7]. ARIMA is one of the models used in time-series forecasting and its accuracy is recognisable for the short-term forecasting. ARIMA is a forecasting model that fully ignores independent variables and uses dependent variables where data is interconnected and has some assumptions that must be fulfilled such as autocorrelation, trend, or seasonal [8]. ARIMA uses its previous data values to produce accurate short-term forecasting. In the SARIMA model (p,d,q)(P,D,Q)S, parameters p and P indicates non-seasonal AR values and seasonal AR values, the parameters q and Q indicates non-seasonal MA values and seasonal MA values, and parameter d indicates the differencing process in non-seasonal data for D in seasonal data [9]. The ARIMA method is divided into 4 groups, namely AR, MA, ARMA and ARIMA. Inserting a SARIMA model into data involves the following things four-step recurring cycle: (a) Identification of SARIMA structure (p, D, Q) (P, D, Q); (b) estimate the unknown parameters; (c) Perform tests on residual estimates; (d) Forecasting future results based on known data [10]. The SARIMA method is defined as data that has a repeating pattern within a fixed period of time. Since there are seasonal patterns, the models used by the mathematical ARIMA are ARIMA (p, D, Q) (P, D, Q) S with the formulae models (3) [11].

D. Model Candidate Determination
The SARIMA model in the study has three orders namely p, d, and q for non-seasonal data while the three orders P, D, and Q are for seasonal data and the S order for the frequency of data used. The determination of SARIMA order candidates can be done by analyzing the plot of Autocorrelation function (ACF) and Partial Autocorrelation Function (PACF). The ACF Plot is used to measure the correlation between time series data and time-lag. The ACF Plot is used to indicate an Autoregressive (AR) or order (p, p) value. The PACF Plot is used to measure the amount of correlation between variables and time-lag after removing the linear dependency that is at the bottom lag. The PACF Plot is used to indicate the value of Moving Average (MA) or order (q, Q). The order value (d, D) is determined by the number of differencing processes performed in stationary data changes. As for the S order is determined by looking at the frequency of data used, weekly, monthly, yearly and others.
The value of the order can be seen from the results of the plot ACF and the plot of PACF with the existence of dies down and cut off. The dies down pattern occur when the data decreases to close to a value of 0 slowly. While the cut off pattern occurs when the data is approaching a value of 0 at the initial lag or visible patterns of images have drastically decreased. Determining order value based on ACF and PACF plot conditions can be seen in Table 1.
After the ACF and PACF plotting on each dataset, a white noise test is performed to determine if there is a residual between lag with the Ljung-Box model. If the resulting p-value is greater than α = 0.05 then the value meets the criteria of the white noise test. The Ljung-Box formula as follows [12].
After the ACF and PACF plotting on each dataset, a white noise test is performed to determine if there is a residual between lag with the Ljung-Box model. If the resulting p-value is greater than α = 0.05 then the value meets the criteria of the white noise test. The Ljung-Box formula as follows [13].
The smallest AIC value is a candidate for the selected SARIMA model for forecasting process.

E. Testing Model for Prediction and Evaluation
After obtaining SARIMA model candidates, the next step is to test each models. The testing process is divided into two stages: forecasting and evaluation. The models built in this experiment forecast imports in different periods: 6 (six) periods, 12 periods and 24 periods. Then, the evaluation calculates the error rate of each forecasting model using Mean Absolute Percentage Error (MAPE) [14]. MAPE is an alternative method used to measure the accuracy level of a forecasting model in a percentage unit (fraction). MAPE is an average of the overall percentage of error results from actual data and forecasting data. A low MAPE value indicates the resulting value is approaching its actual value. The MAPE formulae is shown in (6) [15].
The test will be conducted using trial and error method, import dataset amounting to 211 data will be divided into training dataset and testing dataset (shown in Table 2).

F. Best Model Determination
MAPE scores of all forecasting model candidates act as the selection criteria of the forecasting model. The best model is the one with low MAPE scores (error rate) or high accuracy.

A. Preprocessing 1) Attribute Removal
At this stage, only two out of five attributes are used: the year and the total. While the Consumption Goods, Raw Material Support, Capital Goods attributes are dismissed in forecasting process. Such removal process is trivial but the choice of attributes for selection was based on the forecasting target: the amount of yearly imports.
Regarding the year attribute, reordering of the year was done to ensure it is consistent with the time-series x-axis. In addition, the year attribute needs reformatting from mm-yyyy to dd-mm-yyyy. A peek of final import dataset can be seen in Table 3.

2) Stationary Test
A stationary test can be done in two ways, first by looking at the original data graph plot or viewing the graphic plot of the ACF data. FIgure 2 indicates that the data is not stationary because the graph do not fit to a straight line. From the image, it appears that the ACF shows a value that exceeds the line at the initial lags and decreases very slowly. From both tests, it can be ensured that the data has not been stationary.

3) Differencing
The next step is to differencing the data by using the diff () function of timeseries package in R. Differencing process is done once and Figure 3 shows the resulting ACF plot. From the graph in Figure 3, the import dataset is now in a stationary form. The ACF plot indicates the presence of significant changes showed by the boxed values but there is a repetition of seasonal patterns or patterns occurring. On the ACF plot, the seasonal pattern occurs nearly by the increment of 12, so the S value used is S = 12. Then, by re-differencing the seasonal lag to determine the candidate order model on the seasonal pattern. From the data graph and the ACF/PACF plot in Figure 4, the data has been changed to stationary. The data graph shows a straight chart at a value of 0 in the middle. It shows that the data is stationary and the ACF plot is also subjected to significant changes and does not exceed the line limit.

B. Model Candidate Determination
The next stage is the determination of the candidate order model p, q, P and Q via observation to the ACF and PACF plots. Based on Figure 4, the dataset graph shows a straight chart at a value of 0 in the middle. It shows the data is stationary and the ACF plot has also undergone significant changes. Determining candidate order models that do not have a seasonal pattern is done by looking at the initial lag (lag 1, 2, 3, and so on) while determining the candidate model on the data that has a seasonal pattern seen at lag 12, 24 and 36. Meanwhile, both seasonal and non-seasonal data only needs one differencing process, thus, the D value is 1.

1) White Noise Test
Assuming white noise is met when it meets the criteria, i.e. when p-value resulting from the Ljung-Box process is greater than α = 0.05 then the value meets the criteria [16]. The import dataset has a pvalue value that can be seen in Table 4. Both datasets have a p-value that exceeds α = 0.05, so white noise assumptions are fulfilled. To display the p-value value of Ljung-box using the box test() function in the Rstudio application.

2) Akaike's Information Criterion (AIC)
The best models are models that have the smallest AIC value of all existing model candidates [17]. Comparison table of values on each candidate model can be seen in Table 5. From both stages of selection of the best models, it can be concluded that the best ARIMA models are ARIMA (0,1,3)(0,1,1) 12 , because each dataset has the smallest AIC value. Table 4. Import dataset of ARIMA model candidates' Ljung-box value

ARIMA Model
Ljung-Box

C. Testing Model for Forecasting
Testing was conducted with the training and test data specified in Table 2. The testing process is divided into two, i.e. forecasting and calculates error rate forecasting results. Forecasting is conducted to obtain forecasting of import results in several periods. After the forecasting results, then the calculation of error rate of each forecasting using MAPE. As for the calculation of MAPE using (10) and assisted by the MAPE() function of package MLmetrics on RStudio applications.

D. Evaluation of Forecasting Results
After testing each model for prediction and getting the predicted result for the current period, the next step is to calculate the prediction error rate using MAPE. The final result obtained for each model can be seen in Table 6 for 6-period forecast model, Table 7 for 12-period forecast model, and Table 8 for 24-period forecast model The result of forecasting on the import dataset with 6 periods resulted in ARIMA(0,1,3)(0,1,1) 12 as the best model with MAPE value of 7.516 % or an accuracy rate of 92.79 %. The result of forecasting on the import dataset with 12 periods resulted in a different model with the previous period of 6 periods. In this period resulted in the best two models because it produces the same MAPE model ARIMA(1,1,0)(0,1,1) 12 and ARIMA(2,1,0)(0,1,1) 12 with MAPE value of 16.029 % or an accuracy rate of 83.971 %.
From these experiments, SARIMA is superior to short-term forecasting, this result is consistent with previous research [18] stating that the more periods produced from the dataset can lead to a higher accuracy of forecasting. Therefore, forecasting import in 6 periods (month) produced more set for forecasting, thus, a better accuracy