Time Series R with duplicate Items for daily forecast - r

I would like a guidance no how to plot daily data and use forecasting in R.
There are low purchase for Sunday and Saturday in this data. And there are certain weekdays that have no purchase at all. So its the obstacles for the analysis.
I have around 300 rows with various item name which the items are duplicated inside the column, but with different dates.
Example, I bought exactly 1 soap 3 times a week, at monday, wednesday and also sunday.
This is the example data table :
My trouble so far is that it took me a long time to forecast manually using other statistical software, so I try to learn R from the start and see how it could save the time. The table above have been put into R, the date also have been converted from factor into date class by using the function as.Date(data$Date)
Usually i used exponential smoothing method, since the purchase are still low and sometimes out of stock, so not much of pattern are shown from the historical data. The output of this analysis is that i could provide a forecast for the purchase of the item daily in order to give instruction when should we demand an item.

First please consider adding a reproducible example for a more substantial answer. Look at the most upvoted question with tag R for a how-to.
EDIT: I think this is what you want before creating the ts:
data.agg <- aggregate(data$purchase, by = list(data$date, data$item), FUN = sum)
If your data is not yet of class 'ts' you can create a time-series object with the ts() command. From the ?ts page:
ts(data = NA, start = 1, end = numeric(), frequency = 1,
deltat = 1, ts.eps = getOption("ts.eps"), class = , names = )
as.ts(x, ...)
Generally you could use the HoltWinters function for exponential smoothing like so:
data.hw <- HotlWinters(data)
data.predict <- predict(data.hw, n.ahead = x) # for x = units of time ahead you would like to predict
See also ?HoltWinters for more info on the function
Reproducible Example for aggregate:
data <- data.frame(date = c(1, 2, 1, 2, 1, 1), item = c('b','b','a','a', 'a', 'a'), purchase = c(5,15, 23, 7, 12, 11))
data.agg <- aggregate(data$purchase, by = list(data$date, data$item), FUN = sum)
Reproducible Example for HoltWinters:
library(AER)
data("UKNonDurables")
nd <- window((log(UKNonDurables)), end = c(1970, 4))
tsp(nd)
hw <- HoltWinters(nd)
pred <- predict(hw, n.ahead = 35)
pred
plot(hw, pred, ylim = range(log(UKNonDurables)))
lines(log(UKNonDurables))

Related

Time series daily data modeling

I am looking to forecast my time series. I have the following period daily data 2021-Jan-1 to 2022-Jul-1.
So I have a column of observations for each day.
what I tried so far:
d1=zoo(data, seq(from = as.Date("2021-01-01"), to = as.Date("2022-07-01"), by = 1))
tsdata <- ts(d1, frequency = 365)
ddata <- decompose(tsdata, "multiplicative")
I get following error here:
Error in decompose(tsdata, "multiplicative") :
time series has no or less than 2 periods
From what i have read it seems like because I do not have two full years? is that correct? I have tried doing it weekly as well:
series <- ts(data, frequency = 52, start = c(2021, 1))
getting the same issue.
How do I go about it without having to extend my dataset to two years since I do not have that, and still being able to decompose it?
Plus when I am actually trying to forecast it, it isn't giving me good enough forecast:
Plot with forecast
My data somewhat resembles a bell curve during that period. so is there a better fitting timeseries model I can apply instead?
A weekly frequency for daily data should have frequency = 7, not 52. It's possible that this fix to your code will produce a model with a seasonal term.
I don't think you'll be able to produce a time series model with annual seasonality with less than 2 years of data.
You can either produce a model with only weekly seasonality (I expect this is what most folks would recommend), or if you truly believe in the annual seasonal pattern exhibited in your data, your "forecast" can be a seasonal naive forecast that is simply last year's value for that particular day. I wouldn't recommend this, because it just seems risky, and I don't really see the same trajectory in your screenshot over 2022 that's apparent in 2021.
decompose requires two full cycles and that a full cycle represent 1 time unit. ts class can't use Date class anyways. To use frequency 7 we must use times 1/7th apart such as 1, 1+1/7, 1+2/7, etc. so that 1 cycle (7 days) covers 1 unit. Then just label the plot appropriately rather than using those times on the X axis. In the code below use %Y in place of %y if the years start in 19?? and end in 20?? so that tapply maintains the order.
# test data
set.seed(123)
s <- seq(from = as.Date("2021-01-01"), to = as.Date("2022-07-01"), by = 1)
data <- rnorm(length(s))
tsdata <- ts(data, freq = 7)
ddata <- decompose(tsdata, "multiplicative")
plot(ddata, xaxt = "n")
m <- tapply(time(tsdata), format(s, "%y/%m"), head, 1)
axis(1, m, names(m))

Time series prediction with and without NAs (ARIMA and Forecast package) in R

This is my first question on stack overflow.
Situation: I have 2 time series. Both series have the same values but the second series has 5 NAs at the start. Hence, first series has 105 observations, where 2nd series has 110 observations. I have fitted an ARIMA(0,1,0) using the Arima function to both series separately. And then I used the forecast package to predict 10 steps to the future.
Issue: Even though the ARIMA coefficient for both series are the same, the projections (10 steps) appear to be different. I am uncertain why this is the case. Has anyone come across this before? Any guidance is highly appreciated.
Tried: I tried setting seed, creating index manually, and using auto.ARIMA for the model fitting. However, none of the steps has helped me to reconcile the difference.
I have added a picture to show you what I see. Please note I have hidden the mid part of the series so that you can see the start and the end of the series. The yellow highlighted cells are the projection outputs from the 'Forecast' package. I have manually added the index to be years after extracting the results from R.
Time series projected and base in excel
Rates <- read.csv("Rates_for_ARIMA.csv")
set.seed(123)
#ARIMA with NA
Simple_Arima <- Arima(
ts(Rates$Rates1),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima <- forecast(Simple_Arima, h = 10)
fcasted_Arima$mean
#ARIMA Without NA
Rates2 <- as.data.frame(Rates$Rates2)
##Remove the final spaces from the CSV
Rates2 <- Rates2[-c(106,107,108,109,110),]
Simple_Arima2 <- Arima(
ts(Rates2),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima2 <- forecast(Simple_Arima2, h = 10)
fcasted_Arima2$mean
The link to data is here, CSV format
Could you share your data and code such that others can see if there is any issue with it?
I tried to come up with an example and got the same results for both series, one that includes NAs and one that doesn't.
library(forecast)
library(xts)
set.seed(123)
ts1 <- arima.sim(model = list(0, 1, 0), n = 105)
ts2 <- ts(c(rep(NA, 5), ts1), start = 1)
fit1 <- forecast::Arima(ts1, order = c(0, 1, 0))
fit2 <- forecast::Arima(ts2, order = c(0, 1, 0))
pred1 <- forecast::forecast(fit1, 10)
pred2 <- forecast::forecast(fit2, 10)
forecast::autoplot(pred1)
forecast::autoplot(pred2)
> all.equal(as.numeric(pred1$mean), as.numeric(pred2$mean))
[1] TRUE

How to forecast time series for many products, including a seasonality factor in R

Following my previous question with one product to forecast, lets say I have 5 products to forecast as in the following data:
units_vector <- c(89496264,81820040,80960072,109164545,96226255,96270421,95694992,117509717,105134778,0)
library(data.table)
dt <- data.table('time' = rep(c(1:10),3),
'units'= c(units_vector,units_vector+runif(n = 10, max = 1000000),units_vector+runif(n = 10, max = 1000000)),
'product' = c(rep("A", 10),rep("B", 10),rep("C", 10))
)
I would like to forecast the units for time = 10 for all products.
I can see that at time = 4*k, where k = 1,2,... there is a big increase of units, and I would like to include that as a seasonality factor.
How could I do that in R ? Maybe with prophet ? Any other library or way will also do.
You do the same: fit the model y = a*season + b with b = p(product) depending on the product and aequal for all products (if you assume the seasonality is the same for all products).

How to forecast using ragged edge data in a MIDAS model using the MIDASR package?

I am trying to generate a 1-step-ahead forecast of a quarterly variable using a monthly variable with the midasr package. The trouble I am having is that I can only estimate a MIDAS model when the number of monthly observations in the sample is exactly 3 times as much the number of quarterly observations.
How can I forecast in the midasr package when the number of monthly observations is not an exact multiple of the quarterly observations (e.g. when I have a new monthly data point that I want to use to update the forecast)?
As an example, suppose I run the following code to generate a 1-step-ahead forecast when I have (n) quarterly observations and (3*n) monthly observations:
#first I create the quarterly and monthly variables
n <- 20
qrt <- rnorm(n)
mth <- rnorm(3*n)
#I convert the data to time series format
qrt <- ts(qrt, start = c(2009, 1), frequency = 4)
mth <- ts(mth, start = c(2009, 1), frequency = 12)
#now I estimate the midas model and generate a 1-step ahead forecast
library(midasr)
reg <- midas_r(qrt ~ mls(qrt, 1, 1) + mls(mth, 3:6, m = 3, nealmon), start = list(mth = c(1, 1, -1)))
forecast(reg, newdata = list(qrt = c(NA), mth =c(NA, NA, NA)))
This code works fine. Now suppose I have a new monthly data point that I want to include, so that the new monthly data is:
nmth <- rnorm(3*n +1)
I tried running the following code to estimate the new model:
reg <- midas_r(qrt ~ mls(qrt, 1, 1) + mls(nmth, 2:7, m = 3, nealmon), start = list(mth = c(1, 1, -1))) #I now use 2 lags instead 3 with the new monthly data
However I get an error message saying: 'Error in mls(nmth, 2:7, m = 3, nealmon) : Incomplete high frequency data'
I could not find anything online on how to deal with this problem.
A while ago I had to do with similar question. If I remember correctly, you first need to estimate the model using the old dataset with reduced lag, so insted of using 3:6 lags you should use 2:6 lags:
reg <- midas_r(qrt ~ mls(qrt, 1, 1) + mls(mth, 2:6, m = 3, nealmon), start = list(mth = c(1, 1, -1)))
Then suppose you observe a new value of the higher frequency data - new_value
new_value <- rnorm(1)
Then you can use this newly observed value for the forecasting of the lower frequency valiable as follows:
forecast(reg, newdata = list(mth = c(new_value, rep(NA, 2))))

Interpolate missing values in a time series with a seasonal cycle

I have a time series for which I want to intelligently interpolate the missing values. The value at a particular time is influenced by a multi-day trend, as well as its position in the daily cycle.
Here is an example in which the tenth observation is missing from myzoo
start <- as.POSIXct("2010-01-01")
freq <- as.difftime(6, units = "hours")
dayvals <- (1:4)*10
timevals <- c(3, 1, 2, 4)
index <- seq(from = start, by = freq, length.out = 16)
obs <- (rep(dayvals, each = 4) + rep(timevals, times = 4))
myzoo <- zoo(obs, index)
myzoo[10] <- NA
If I had to implement this, I'd use some kind of weighted mean of close times on nearby days, or add a value for the day to a function line fitted to the larger trend, but I hope there already exist some package or functions that apply to this situation?
EDIT: Modified the code slightly to clarify my problem. There are na.* methods that interpolate from nearest neighbors, but in this case they do not recognize that the missing value is at the time that is the lowest value of the day. Maybe the solution is to reshape the data to wide format and then interpolate, but I wouldn't like to completely disregard the contiguous values from the same day. It is worth noting that diff(myzoo, lag = 4) returns a vector of 10's. The solution may lie with some combination of reshape, na.spline, and diff.inv, but I just can't figure it out.
Here are three approaches that don't work:
EDIT2. Image produced using the following code.
myzoo <- zoo(obs, index)
myzoo[10] <- NA # knock out the missing point
plot(myzoo, type="o", pch=16) # plot solid line
points(na.approx(myzoo)[10], col = "red")
points(na.locf(myzoo)[10], col = "blue")
points(na.spline(myzoo)[10], col = "green")
myzoo[10] <- 31 # replace the missing point
lines(myzoo, type = "o", lty=3, pch=16) # dashed line over the gap
legend(x = "topleft",
legend = c("na.spline", "na.locf", "na.approx"),
col=c("green","blue","red"), pch = 1)
Try this:
x <- ts(myzoo,f=4)
fit <- ts(rowSums(tsSmooth(StructTS(x))[,-2]))
tsp(fit) <- tsp(x)
plot(x)
lines(fit,col=2)
The idea is to use a basic structural model for the time series, which handles the missing value fine using a Kalman filter. Then a Kalman smooth is used to estimate each point in the time series, including any omitted.
I had to convert your zoo object to a ts object with frequency 4 in order to use StructTS. You may want to change the fitted values back to zoo again.
In this case, I think you want a seasonality correction in the ARIMA model. There's not enough date here to fit the seasonal model, but this should get you started.
library(zoo)
start <- as.POSIXct("2010-01-01")
freq <- as.difftime(6, units = "hours")
dayvals <- (1:4)*10
timevals <- c(3, 1, 2, 4)
index <- seq(from = start, by = freq, length.out = 16)
obs <- (rep(dayvals, each = 4) + rep(timevals, times = 4))
myzoo <- myzoo.orig <- zoo(obs, index)
myzoo[10] <- NA
myzoo.fixed <- na.locf(myzoo)
myarima.resid <- arima(myzoo.fixed, order = c(3, 0, 3), seasonal = list(order = c(0, 0, 0), period = 4))$residuals
myzoo.reallyfixed <- myzoo.fixed
myzoo.reallyfixed[10] <- myzoo.fixed[10] + myarima.resid[10]
plot(myzoo.reallyfixed)
points(myzoo.orig)
In my tests the ARMA(3, 3) is really close, but that's just luck. With a longer time series you should be able to calibrate the seasonal correction to give you good predictions. It would be helpful to have a good prior on what the underlying mechanisms for both the signal and the seasonal correction to get better out of sample performance.
forecast::na.interp is a good approach. From the documentation
Uses linear interpolation for non-seasonal series and a periodic stl decomposition with seasonal series to replace missing values.
library(forecast)
fit <- na.interp(myzoo)
fit[10] # 32.5, vs. 31.0 actual and 32.0 from Rob Hyndman's answer
This paper evaluates several interpolation methods against real time series, and finds that na.interp is both accurate and efficient:
From the R implementations tested in this paper, na.interp from the forecast package and na.StructTS from the zoo package showed the best overall results.
The na.interp function is also not that much slower than
na.approx [the fastest method], so the loess decomposition seems not to be very demanding in terms of computing time.
Also worth noting that Rob Hyndman wrote the forecast package, and included na.interp after providing his answer to this question. It's likely that na.interp is an improvement upon this approach, even though it performed worse in this instance (probably due to specifying the period in StructTS, where na.interp figures it out).
Package imputeTS has a method for Kalman Smoothing on the state space representation of an ARIMA model - which might be a good solution for this problem.
library(imputeTS)
na_kalman(myzoo, model = "auto.arima")
Also works directly with zoo time series objects. You could also use your own ARIMA models in this function. If you think you can do better then "auto.arima". This would be done this way:
library(imputeTS)
usermodel <- arima(myts, order = c(1, 0, 1))$model
na_kalman(myts, model = usermodel)
But in this case you have to convert the zoo onject back to ts, since arima() only accepts ts.

Resources