Sometimes it is important for a venture to peak into the future to understand how much volume foot traffic or how many units of products they can expect to sell. This could help better prepare for the future by scaling up sales and customer services team to adequately prepare for likely sales over the course of the next few years/months and ensuring there are demand side resources to handle projected growth.
Like our trusted guide, Mr. Knuckles, in the above GIF, we can use historical data points to help us understand not just the present, but the future- helping guide us into the unknown. When we try and make such projections based only on time and values this is known as a time series analysis. We first look for the general trend in our time series to understand whether there has been an increase or decrease in values over time.
Secondly, we look at cyclical fluctuations. You can think of these as patterns of changes that are part of the business cycle. The duration of the business cycle and its nature is dependent on the industry of the business.
We thirdly evaluate the seasonal trends in our time series, these are oscillations that occur daily, month or year- depending on the nature of the forecast. If we are forecasting sales we may for example expect less sales in winter and for sales peaking in summer, these are the seasonal trends of our sales cycle and will typically be repeated from year to year. We would, of course, ideally want the values to be higher and higher each year so you can hit those fancy year-on-year growth numbers and brag about them to other entrepreneurs.
Lastly when the seasonality, trend and cyclical fluctuations are removed from our time series, we are left with the residual effects or irregularities. You can think of these as unpredictable events resulting in outlier highs or lows. By their nature these are unsystematic and have no clear patterns but need to be factored into our forecasting model.
For this analysis I chose a dataset on Kaggle on customer flows between 2012 and 2018. Customer flow in this context refers to the number of customers that walked into a department store in China (the store opened in 2010) per day over the course of the defined period. My aim is to forecast passenger flows over the next 3 years. This would presumably help the store best prepare for the next 3 years.
I chose to use a regression model — Autoregressive Integrated Moving Average (ARIMA) because of the relative simplicity of the model, its flexibility and its ability to perform relatively well in forecasting tasks.
What is ARIMA
Let’s start off by breaking apart the acronym to its individual components, AR, I and MA. The AR part simply speaks about autocorrelation– using the past to explain current data. We assume that customer flows are dependent, atleast to a certain extent, on previous customer flows, the traffic you got the previous day will be somewhat similar to the customer traffic you receive today. With autocorrelation we are trying to find out how many lags/periods of time best predict our current numbers. We call this a lag. So for example, if we find that the optimal lag is 4 days looking at the values of the previous 4 days best tells us what values to expect today.
The ‘I’ refers to using differencing to attain stationarity. Stationarity means the statistical properties of the time series such as the median, variance and correlation remain stationary/constant over time. It is more difficult to carry out forecasting if everything is different tomorrow. It is important for us to find common statistical properties to best look into the future. Differencing refers to the number of times we need to difference (subtract an observation from an observation at the previous time step) the time series against itself to attain stationarity.
When we try and forecast a value using a forecasting model, there will be a delta between our prediction and the actual value. The MA part can be thought of as a collection of these error terms. We want to incorporate these error terms to factor in random fluctuations/irregularities when making our forecast.
Now here comes the fun part- implementation. From my research, R seemed to be best suited for implementing time-series forecasting so I chose to this for my forecasting task.
#uploading data series
passflow = read.csv("~/Downloads/passflow-w.csv", header = TRUE)
#ensuring that dates are in the right format
passflow <- passflow %>%
#grouping values in data series by month
cleanpass <- passflow %>%
colnames(cleanpass) <- "dates"
#filtering out the last month
filter(dates < as.Date('2018-09-01'))
After loading the relevant libraries and my dataset, I grouped the time series by month, removing the need to look at daily and weekly trends (e.g if customer flow tends to peak towards month end etc). I then filtered out dates after the 1st of September 2018 as this was the last observed month. I assumed that this would not be valuable for my time series analysis as this was an incomplete month.
#convert to time series
cleanpasstime <- ts(cleanpass,frequency =12,start=2012)
#plotting seasonal means/medians
Using the ts function, I converted the numerical values in my dataset to a time series, specifying monthly frequency and the commencement date. This returns a time series of monthly values from 2012 to 2018.
Using a boxplot for visualisation and the cycle I created a visualisation showing thenmonthly trends in my time series.
Average customer flow peaks in January, falling to its lows in June and progressively rises up during the last half of the year. Without further context about the nature of the business it would be difficult making any further assumptions about what these trends are caused by. It does however show us that there are seasonality trends that need to be factored into my time series.
#plotting yearly trends
I then created a visualisation of yearly trends to understand whether there is an overall downward or upward trend in customer flow by looking at mean traffic values over the course of each year.
Testing for stationarity
To test for stationarity I used the Augmented Dickey-Fuller test (ADF). This is a unit root test that is typically used for testing the stationarity of a time series. I’m not going to claim to have a deep understanding of the computations carried out in this test, but will run through a high level explanation of the results.
#Augmented Dickey-Fuller Test to test for stationarity
lapply(cleanpasstime, function(x) adf.test(x[!is.na(x)],
#### RESULTS ####
Augmented Dickey-Fuller Test
Dickey-Fuller = -3.9389, Lag order = 4, p-value = 0.01676
alternative hypothesis: stationary
With ADF the more negative the test statistic the stronger the case is for the rejection of the null hypothesis that the data is non stationary. Since the p-value is under 5% the ADF test assumes stationary. However it is important to note that since I chose the default value of the lag order, which in this case is 4, the ADF is actually testing whether the information of the past 4 months best explains current values. This does not however fully factor in seasonality which may be cyclical beyond the 4 month lag period. We can visualise our time-series again to understand what the business cycle looks like.
I could opt to use the Kwiatkowski-Phillips-Schmidt-Shin test (KPSS) to test whether my series is stationary around a mean or linear trend. The null hypothesis in this case would indicate that my series is stationary around a mean or linear trends. However, I did not, as I assumed that the time series appeared to be stationary- I am however open to revisiting this assumption.
Training and Test split
I want to be able to test the accuracy of my predictions, in order to do this, I created a training and test data. Test data will be used to gauge how well predictions made by the model I trained with ARIMA perform against observed customer flow.
#training and test
train_pass = window(cleanpasstime,start = c(2012,1),end=c(2018,5))
test_pass = window(cleanpasstime,start=c(2018,6),end=c(2018,8))
Forecasting with ARIMA
Three parameters need to be tuned to find the best ARIMA fit- p (AR),q(MA),d(I). Since this can be a time consuming, I instead opted to use auto.arima. This function returns the best ARIMA parameters.
#finding best values for arima model
vals <- auto.arima(train_pass,stepwise=FALSE,approximation=FALSE,seasonal =TRUE)
#### RESULTS ####
According to auto.arima, the best parameters for my model are (1,1,1) and (2,0,0) to capture seasonal parameters.
#fitting model and forecasting
fit <- arima(train_pass,c(1,1,1),seasonal=list(order=c(2,0,0),period=12))
#plotting predictions agains observations
After adding the prescribed parameters I created a forecast for 3 years into the future, I then proceeded to plot these values, also making sure to plot my test observations to get a rough idea of the accuracy of this model.
Testing model accuracy
The accuracy function is available to diagnose the fit of the ARIMA model by printing out multiple measures of accuracy.
index <- c(3,5)
arima_forecasts = forecast(vals,h=3*12)
### RESULTS ####
Training set 36850.42 6.219505
Test set 13896.83 2.201053
To assess model fit I chose to use the mean absolute error (MAE), and the mean percentage error (MPE).
MAE computes the difference between predicted values and observed values over the overall number of observations.
While MAPE carries out the same calculation but instead returns this as a percentage.
For both these values the lower the values the better the model fit.
As detailed in the accuracy table MAE indicates that the average magnitude of errors in my forecasting model is 13896 customers while an MAPE of 8.6 indicates that on average my predictions are off by about 8% on my test series.
I can compare this with a seasonal naive forecast to assess whether the ARIMA is an improvement over naive forecasting. Naive forecasting approach produces forecasts equivalent to the last observed value. Seasonal naive goes a step further by factoring in seasonality. Each prediction becomes equivalent to the last observed value of the same season.
naive <- snaive(train_pass, h=3*12)
#### RESULTS ####
Training set 52328.46 2.895538
Test set 53218.33 8.611081
This visibly performs worse on my test series further validating the goodness of fit of my ARIMA model.
An interesting alternative to my forecasting problem is using an open source package created by Facebook that makes the task of forecasting more accessible and easier to carry out. For more detail the white paper is widely available for further reading.
The great part of fbprophet is that it automatically detects change points in your time series and allows you to factor in hourly, daily, weekly, monthly, yearly trends and even allows you to factor inchanges of trends during pre-defined holidays to make you forecast more accurate.
p <- prophet(passflow,yearly.seasonality = TRUE,seasonality.mode = 'multiplicative')
fut_vals <- make_future_dataframe(p,periods=3*365)
forecasts <- predict(p,fut_vals)
After a few trials, I got the best model from turning on yearly seasonality and setting seasonality mode as multiplicative. This improved the accuracy of my model because seasonality grows with the trend in my time series, it is not a constant additive.
The forecasted value reveals both the trends, the upper and lower bound forecasts (lowest and highest forecasted values).
I also opted to plot the trend, daily and yearly seasonality to understand underlying trends and other components in the model.
From these charts it appears as though customer traffic starts picking up from Thursday leading up to the weekend. Customer traffic is at its peak during weekends- presumably because people are more likely to frequent this store when they not at work or at school. From a month to month basis we see upward oscillations around December-January period and the lowest customer flow around the June-July period. It appears as though the store generally gets more traffic in the first and last quarter of the year.
The great part about FBprophet is the ease of cross validation. The package comes with a function for easy cross validation allowing you to set the ‘test’ time frame you will use to evaluate the model’s accuracy.
df <- cross_validation(p,initial=2300,horizon=90,units='days')
index_cv <- c(4,5)
cv_metrics <- performance_metrics(df)
#### RESULTS ####
Using the MAE and MAPE as metric for evaluation this model appears to perform better than my ARIMA model.