Using autoregressive integrated moving average models in the analysis and forecasting of mobile network traffic data

Developing prediction models for mobile networks have been increasing in recent years, in response to the ever increasing volumes of customer traffic and also to understand the characteristics of traffic pattern. This study seeks to evaluate the forecasting performance of Autoregressive Integrated Moving Average (ARIMA) models by using an empirical data measured from a live High Speed Downlink Packet Access (HSDPA) telecommunication network operator with coverage in the northern part of Ghana. To determine the best ARIMA model, a number of statistical analysis and tests were carried out. The models with the minimum information criteria values were selected: ARIMA (2,1,3), ARIMA (0,1,2) and ARIMA (1,1,1). Comparing the actual and predicted traffic data show that ARIMA (0,1,2) is the best model.


INTRODUCTION
The increasing demand of bandwidth due to the introduction of packet data services in telephone system networks has been a challenge to telephone operators to upgrade their core networks with high capacity network solutions.
The prediction of network traffic plays an important role in designing, optimization and management of modern telecommunications networks (Yu et al., 2013;Klevecka and Lelis, 2009).With the introduction of High Speed Downlink Packet Access (HSDPA) (Anon, 2004), the available bandwidth has taken yet another leap and today's users can achieve theoretical upload/download speeds up to 4.2 Mbit/s or 13.1 Mbit/s.
HSDPA is commonly known as 3.5G, as it offers no substantial upgrade to the feature set of Wideband Code Division Multiple Access (WCDMA), but improves on the speed of data transmission to enhance those services (Anon, 2004).WCDMA is a mobile technology that improves upon the capabilities of current GSM networks that are deployed around the world.It is generally refer as 3G technology, or 3rd generation, and it provides newer services like video calling to the traditional call, and text messaging features that are already standardised.Prior to the introduction of HSDPA, WCDMA networks were only capable of reaching speeds of 384 kbps.Although this might be sufficient for most services, customers always want faster speeds, especially when browsing the internet or downloading files.HSDPA allowed speeds above 384 kbps, the most notable of which is 3.6 Mbps and 7.2 Mbps, which a lot of telecommunications companies often advertise.HSDPA is capable of reaching much higher speeds depending on the type of modulation that is being employed.The theoretical maximum speed of HSDPA is up to 84 Mbps (Anon, 2004).
In general, most time series data are found to exhibit non stationary characteristics and therefore a nonseasonal autoregressive integrated moving average (ARIMA) model is widely applied in forecasting this pattern (Namin and Namin, 2018;Ratnayaka et al., 2015).
One basic assumption of ARIMA is the ability to model patterns or combination of patterns which are recursive in time series data, therefore determining and deducing these patterns can help in forecasting (Marek et al., 2008;Haviluddin and Alfred, 2014).
ARIMA models have been described as the most appropriate type for examining these trends because of its ability to adjust to well-balanced statistical assumptions (Ratnayaka et al., 2014;Lee and Tong, 2012).In addition, ARIMA models are described as the most efficient and applicable for experimental data analysis when characteristics of normality, linearity and stationarity postulations are considered (Chen, 1994).
ARIMA models retain high power of predictability with minimal error (Iqbal and Naveed, 2016).ARIMA models are quite flexible in that they can represent several different types of time series and also have the advantages of accurate forecasting over a short period of time and it is easy to implement (Khashei and Bijari, 2011).
A traditional prediction modelling technique that has enjoyed massive usage is ARIMA which is selected for this study owing to the non-stationary characteristics of the collected HSDPA traffic data set (Ratnayaka et al., 2014;Lee and Tong, 2012;Chen, 1994).
Prediction of HSDPA traffic has been growing in recent years.For example, Lawal et al. (2016) applied neural network ensemble to investigate traffic pattern of HSDPA using 690 data points made up of aggregated hourly measurement and concluded that the proposed model predicts well.Abdulkarim and Lawal (2017) also applied the cooperative neural network method to develop forecasting model for HSDPA traffic and user throughput data.Tan et al. (2008) examined the 3G network capacity, the throughput and delay performances of IPbased applications over 3G and HSDPA networks.Khan (2017) experimented with a single site traffic data and countrywide sites traffic data from 3G cellular network in Pakistan and concluded that ARIMA models perform better when the countrywide scenarios are considered while exponential smoothing models give better performance for a single site scenario.
Previous studies on the statistical characteristics of 3G mobile traffic data have identified the presence of selfsimilarity (Yadav, 2011) and long range dependence (Zhou et al., 2006), therefore suggested that linear models cannot give appropriate prediction.For instance a study by Yu et al. (2013) identified multifractal characteristics in 3G mobile network and applied a combined ARMA and FARIMA model to forecast the traffic.The results indicated that ARMA and FARIMA models are not capable of predicting 3G downlink traffic due to the inherent self-similarity and multifractal characteristics.Svoboda et al. (2008) applied four different methods: linear, exponential regression, and two Afr J Eng Res 2 more sophisticated ARMA and Dynamic Harmonic Regression (DHR) to forecast packet switched traffic from live 3G networks.The results showed that sophisticated ARMA and DHR models deliver a better performance.
On the contrary this study seeks to evaluate the performance of ARIMA models by using empirical data from a HSDPA network operator which covers the northern sector of Ghana.

RELATED WORKS
Several researches have been done in the prediction of traffic patterns in 3G networks (Zhang and Cuthberth, 2008).Tso et al. (2010) presented an empirical study on the performance of mobile HSPA networks in Hong Kong.In contrast, Paul et al. (2011) studied traffic dynamics in cellular data networks and focused purely on data traffic in the context of resource usage and subscriber behaviour.The work of Yao et al. (2008) evaluated mobile bandwidth predictability for HSDPA networks using information theory, entropy and concluded that the bandwidth uncertainty reduces drastically when observations from past trips are used to predict bandwidth.Mäder and Staehle (2006) studied the received interference in HSDPA and proposed a semi-analytical approach to calculate the spatial distribution of the received interference.Their method analysed the interference generated in the own cell and in the surrounding cells including the load dependent interference coming from the dedicated channel users.
In Svoboda et al. (2008), a large-scale cell based measurement analysis of the user behaviour in a live operational HSDPA network was presented.Their research work concentrated on statistical properties of users in cells for refining network planning procedures and to provide realistic traffic models for simulations of cellular packet-oriented networks.Their results suggested four different models which show that conventional simulation settings can lead to an overestimation of performance.
Laner et al. ( 2012) measured 3G uplink delay in an operational HSPA network and showed that the average delay is strongly dependent on the packet size.They found that last mile delay constitutes a large fraction of measured delays.
In addition, El Bouchti and Haqiq (2011) also studied analytically and numerically the various parameters of performance (loss probability, average delay of packets, average number of packets in the buffer) of control in HSDPA networks.
In Yerima and Al-Begain (2009), a dynamic buffer management scheme for HSDPA multimedia sessions with aggregated real-time and non-real time flows was proposed.The end-to-end performance impact of the scheme is evaluated with an example multimedia session comprising a real-time streaming flow concurrent with TCP-based non real-time flow via extensive HSDPA simulations.Their results demonstrated that the scheme can guarantee the end-to-end quality of service (QoS) of the real-time streaming flow, whilst simultaneously protecting non real-time flow from starvation resulting in improved end-to-end throughput performance.
In Elmokashfi et al. (2012), an active measurement covering 90 voting locations over a period of 6 months were conducted to determine long-term delay using 3 Norwegian 3G networks data connections.They observed that the delay characteristics of the different operators are very different, and that operator-specific network design and configurations are the most important factors for delays.Their study however, concentrated on RTT of 3G traffic data.

METHODOLOGY
This section gives a brief description of the methods applied in developing the model for the HSDPA daily traffic.

ARIMA (p, d, q) model for HSDPA daily traffic
ARIMA model takes historical data and consist of three parts: an autoregressive (AR) process which maintains memory of past events, an integrated (I) process which makes data stationary for easy forecast and a moving average (MA) process for the memory of forecast errors.

Autoregressive model
The autoregressive model of order p with zero mean, denoted as AR(p) is computed in Equation (1) as Yuan et al. (2016): where is the time series, , , … , ( ≠ 0) are model parameters, is white noise with ~ (0, ).The autoregressive operator is given in Equation 2as Shiumway and Stoffer (2006): where B is the backward operator.

Moving average model
In moving average (MA) modelling, the lagged values are applied as forecast errors to advance the most recent forecast.In such instances, an MA(1) term takes the current forecast error while the MA(2) term relies on Oduro-Gyimah and Boateng 3 forecast error of the two current periods, similarly it is applied for higher order terms (Iqbal and Naveed, 2016).
The moving average model of order q, MA(q) is provided in Equation 3: where is the time series, is existing in the moving average lags, , , … , ( ≠ 0) are model parameters and is white noise with, ~ (0, ).The non-seasonal moving average operator is given in Equation 4as Shiumway and Stoffer (2006): where B is the backward operator.

Autoregressive moving average model
The combined non-seasonal autoregressive and moving average models is called autoregressive moving average (ARMA) model.The ARMA model is given as Ratnayaka et al. (2015) in Equation 5: Otherwise written as: The ARMA (p, q) model can also be written in Equation 7as Shiumway and Stoffer (2006):

First order difference equation
In developing the model for the HSDPA traffic data, the series must be stationary.This study applied the differencing approach using Equation 8. ( To confirm that the series is stationarity or otherwise, a unit root test was conducted using Augmented Dickey Fuller (ADF) test (Bhandari et al., 2018), Philip-Perron (PP) test (Phillips and Perron, 1988) and Elliot-Rothenberg-Stock (ERS) test (Elliot et al., 1996), respectively.

Autoregressive integrated moving average
The ARIMA (p, d, q) model can be mathematically written in Equation 9as: which is also written as (Bhandari et al., 2018): where ( ) and ( ) are the and degree polynomials given in Equations 9 and 10, B is the backward shift operator, d is the differencing, is the time series and is the innovation of the original time series, p, d and q be the order of non-seasonal AR, differencing and MA respectively.
The next step in the modelling process is to determine the order of the AR and MA in model specification by applying the autocorrelation function (ACF) and partial autocorrelation function (PACF) approach as shown in Equations 11 and 12.
The estimated ACF is explained as (Shuona and Biqing, 2014): where, ̅ = ∑ The PACF is given as (Shuona and Biqing, 2014): where, , = , , , The next phase is the parameter estimation of the model in Equation 9.In this study the least square estimation approach was adopted and the best models were selected using t-statistics and p-value.The diagnostic checks were done on the selected model for adequacy by analysing the residual plots, and finally forecasting with the model.

Data collection
The data for this study is a primary source of daily HSDPA traffic collected from a telecommunications network operator at the Radio network controller (RNC) level covering the northern sector of Ghana.The number of NodeBs that were used in the experiment was 191.The data was collected daily from March 2015 to February 2017.The EViews software and R software were employed for the analysis.
Figure 1 exhibits a non stationarity in the HSDPA daily traffic data.Figure 2   data is positively skewed with a maximum value of 540684.2kbps and minimum value of 27988.12kbps.The ACF plot, PACF plot and Q-statistics value of HSDPA traffic for 36 lags are shown in Figure 3.The ACF plot exhibits exponential decay which slowly decreases while PACF plot dies down at lag 3 and no more.From the plot and Q-statistic values, it is clear that the series is not stationary and therefore it must be differenced.
From Table 1, the test statistics of the ADF is less than the critical value therefore the data is not stationary.
From Philip-Perron method in Table 2, the test at constant level of 1% does not clearly indicate stationarity or non stationarity with the absolute value of both critical value and test statistic being approximately equal to 3.40.However, at 5% and 10% level, the test statistic value is greater than the critical value which suggests stationarity.From the analysis of the trend stationarity with the PP test, it is observed that the traffic data is not stationary for 1, 5 and 10% levels.The test results for both constant and trend levels at 1, 5 and 10% level showed that the traffic data is stationary since the test statistic value is greater than the critical values.On the basis of the PP test producing conflicting results, the ERS unit root test was conducted for constant and both constant and trend.From Table 3, ERS test confirmed that the traffic data is not stationary for constant level.The null hypothesis of the unit root being stationary for 1% level was accepted while it was rejected for 5 and 10% level.
Based on the results obtained from the stationarity test, the data was therefore differenced using Equation 8 and the result is shown in Figure 4 which suggests stationarity.
From Table 4, the test statistic of the ADF is greater than the critical values in all three levels; therefore the data can be confirmed to be stationary.
The test results of stationarity using the PP method also confirmed that the traffic data is stationary since the test statistic values are greater than the critical values as illustrated in Table 5.
To further confirm the test, ERS approach was applied to the differenced traffic data, and from Table 6, the data is proven to be stationary.

Parameter estimation
Using the method of least square estimation the parameters for the selected models are shown in Tables 7, 8 and 9. From the analysis, ARIMA (2, 1, 3) is not statistically significant since the t-value for AR(1) is less than 2, it is therefore rejected.However, ARIMA (0,1,2) and ARIMA (1,1,1) were found to be statistically significant with the absolute t-statistic values greater than 2.
From Table 10, the difference in the AIC, AICc and BIC values of the three models are approximately negligible, however, using the principle of parsimony ARIMA (0,1,2) model is selected.

CONCLUSION
The study has developed ARIMA models to forecast the daily HSDPA network traffic with coverage in the northern sector of Ghana.In achieving these results the models with the minimum information criteria values were selected: ARIMA (2,1,3), ARIMA (0,1,2) and ARIMA (1,1,1).The best model selected out of the three competing models is ARIMA (0,1,2).The model was validated with actual HSDPA daily traffic data and the forecasting results indicated a minor deviation.The   used to predict for out-of-sample traffic data.The approach proposed in this study could be helpful to mobile operators in planning and maintaining their HSDPA networks.

Figure 2 .
Figure 2. Histogram and statistics of HSDPA daily traffic data.

Figure 4 .
Figure 4.The first order differenced daily HSDPA traffic data plot.

Figure 6
Figure6shows the standardised residuals.The standard residuals shows zero mean and few outliers.The ACF plot in Figure7exhibits no proof of significant correlation in the residuals at any positive lag except at 7 and 28.The two lags are however not significant as such the model is adequate.

Figure 8 .
Figure 8. Graph of actual against forecast data.

Figure 9 .
Figure 9. Out-of-sample forecast of traffic data.
shows the histogram and the statistical description of the HSDPA daily traffic data.The

Table 1 .
Test for stationarity of HSDPA daily traffic data using ADF approach.

Table 2 .
Test for stationarity of HSDPA daily traffic data using Philip-Perron approach.

Table 3 .
Test for stationarity of HSDPA daily traffic data using Elliot-Rothenberg-Stock (ERS) approach.

Table 4 .
Test for stationarity of differenced HSDPA traffic data using ADF approach.

Table 5 .
Test for stationarity of differenced HSDPA traffic data using Philip-Perron approach.

Table 10 .
AIC, AICc and BIC of suggested models.