forecast() in R repeats predictions - r

I'm trying to build a forecast to predict future values of a keyword from Google Trends data.
My data is the daily indexed search volume from Jan 1 to Jun 30, 2020 for a keyword, saved in a CSV file:
Date | Keyword
2020-01-01 | 55
2020-01-02 | 79
2020-01-03 | 29
...
2020-06-29 | 19
2020-06-30 | 32
My R code seems to work okay until it generates the forecasts.
library(forecast)
data <- read.csv("<file path>.csv", header=TRUE)
#build time series data
#start and end periods of observed data
inds <- seq(as.Date("2020-01-01"), as.Date("2020-06-30"), by = "day")
#the frequency = 7 days (i.e. week)
sts <- ts(data$Keyword, start = c(2020, as.numeric(format(inds[1], "%j"))), frequency = 7)
#generate the forecast
model.ets <- ets(sts, model = "ANA")
fc.ets <- forecast(model.ets, h = 60)
plot(fc.ets)
The problem I'm having is that the forecast simply repeats the same pattern (doesn't seem to take into account the error, trend and/or seasonality to adjust the predictions).
I think I need to adjust the forecast() function but not sure how to do it.

In this case we have a daily series spanning less than a year that appears to display a weekly seasonality. Please note as is given here: https://otexts.com/fpp2/ts-objects.html [2.1 - ts objects], the frequency given to the ts object is 52.18 which is 365.25/7, the number of weeks in a year (taking into account leap years). This seasonality rules out the use of ets models which can't handle data with frequency greater than 24, unless used in combination with STL (Seasonal and Trend decomposition using Loess). As such I would recommend exploring other models. STL + ETS(A, Ad, N) [2nd best model] point forecasts look most realistic but the range in our prediction intervals is much larger when compared against the TBATS(1, {0,0}, 0.92, {<52.18, 6>}) model [best model] please see and play around with the below:
ts_ausAirBnb <- ts(ausAirBnb$airbnb_australia_, start = min(ausAirBnb$day), frequency = 52.18)
plot(decompose(ts_ausAirBnb))
snaivefit <- snaive(ts_ausAirBnb)
snaivefcast <- forecast(snaivefit, h = 60)
aafit <- auto.arima(ts_ausAirBnb)
aafcast <- forecast(aafit, h = 60)
stlffit <- stlf(ts_ausAirBnb, h = 60)
stlfcast <- forecast(stlffit, h = 60)
stlmfit <- stlm(ts_ausAirBnb)
stlmfcast <- forecast(stlmfit, h = 60)
tbatsfit <- tbats(ts_ausAirBnb)
tbatsfcast <- forecast(tbatsfit, h = 60)
nnetfit <- nnetar(ts_ausAirBnb)
nnetfcast <- forecast(nnetfit, h = 60)
autoplot(snaivefcast)
autoplot(aafcast)
autoplot(etsfcast)
autoplot(stlfcast)
autoplot(stlffit)
autoplot(stlmfcast)
autoplot(tbatsfcast)
autoplot(nnetfcast)

Related

Fit double logistic function to a time series

For the following time series data:
#1. dates of 15 day frequency:
dates = seq(as.Date("2016-09-01"), as.Date("2020-07-30"), by=15) #96 times observation
#2. water content in crops corresponding to the times given.
water <- c(0.5702722, 0.5631781, 0.5560839, 0.5555985, 0.5519783, 0.5463459,
0.5511598, 0.546652, 0.5361545, 0.530012, 0.5360571, 0.5396569,
0.5683526, 0.6031535, 0.6417821, 0.671358, 0.7015542, 0.7177007,
0.7103561, 0.7036985, 0.6958607, 0.6775161, 0.6545367, 0.6380155,
0.6113306, 0.5846186, 0.5561815, 0.5251135, 0.5085149, 0.495352,
0.485819, 0.4730029, 0.4686458, 0.4616468, 0.4613918, 0.4615532,
0.4827496, 0.5149105, 0.5447824, 0.5776764, 0.6090217, 0.6297454,
0.6399422, 0.6428941, 0.6586344, 0.6507473, 0.6290631, 0.6011123,
0.5744375, 0.5313527, 0.5008027, 0.4770338, 0.4564025, 0.4464508,
0.4309046, 0.4351668, 0.4490393, 0.4701232, 0.4911582, 0.5162941,
0.5490387, 0.5737573, 0.6031149, 0.6400073, 0.6770058, 0.7048311,
0.7255012, 0.739107, 0.7338938, 0.7265202, 0.6940718, 0.6757214,
0.6460862, 0.6163091, 0.5743775, 0.5450822, 0.5057753, 0.4715266,
0.4469859, 0.4303232, 0.4187793, 0.4119401, 0.4201316, 0.426369,
0.4419331, 0.4757525, 0.5070846, 0.5248457, 0.5607567, 0.5859825,
0.6107531, 0.6201754, 0.6356589, 0.6336177, 0.6275579, 0.6214981)
I want to fit a double-logistic function curve to the data.
I found some examples and packages that can be of help,
https://greenbrown.r-forge.r-project.org/man/FitDoubleLogElmore.html
and an example here - Indexes overlap error when using dplyr to run a function.
However, the examples given only consider annual time series.
I have tried to fit the function as:
x <- ts(water, start = c(2016,17), end = c(2020, 16), frequency = 24)
smooth.water = FitDoubleLogBeck(x, weighting = T, hessian = F, plot = T, ninit = 10)
plot(water)
plot(smooth.water$predicted)
plot(water- smooth.water$predicted)
However, this function does not seem to fit the entire time series. How can I run the function to fit the entire time series? Also, I noticed the output is different at different run, and I am not sure what makes that happen.
FitDoubleLogBeck can deal only with 1-year data, so you need analyze the data year by year. To do it just take window for 1 year then fit the data separately for each year.
As for different results at different runs the algorithm randomly chooses the initial parameters. The graph of double logistic curve is bell shaped. However you applying the algorithm to "sine"-like data but the algorithm expects to have "bell". Then it treats the water data as a cloud of points so the results are meaningless and very sensetive to intial parameter setting.
Code:
set.seed(123)
par(mfrow = c(1, 3))
# water vector taken from question above
x <- ts(water, start = c(2016,17), end = c(2020, 16), frequency = 24)
res <- sapply((2017:2019), function(year) {
x2 <- as.vector(window(x, start=c(year, 1), end=c(year, 24)))
smooth.water2 = FitDoubleLogBeck(x2, weighting = T, hessian = F, plot = T, ninit = 10)
title(main = year)
c(year = year, smooth.water2$params)
})
t(res)
Output:
year mn mx sos rsp eos rau
[1,] 2017 -0.7709318 0.17234293 16.324163 -0.6133117 6.750885 -0.7618376
[2,] 2018 -0.8900971 0.09398673 7.529345 0.6701200 17.319465 0.8277409
[3,] 2019 -4.7669470 -0.34648434 15.930455 -0.2570877 10.690043 -0.2267284

R: monthly expanding regression using daily data

I want to perform an expanding regression at monthly frequency using daily data. The model is:
ret = \beta_0 + \beta_1 X + \varepsilon
Sample data and my attempt:
library(zoo)
df = data.frame(date = seq(as.Date('2011-01-01'),as.Date('2011-03-31'),by = 1), ret = rnorm(90,0,1), X = rnorm(90,0,1))
roll = function(data, n = now(data) {
rollapplyr(1:n, 1:n, function(x) coef(lm(ret ~ X, data, subset =x))[[2]]
}
output = df %>%
mutate(coefficient = roll(data.frame(ret, X)))
The code above runs expanding regression by row and I could extract just the last value in each month to get the coefficients for that month (i.e., in this example, I only need coefficients estimated on Jan-31, Feb-28 and Mar-31).
However, I need to apply this code to a large dataset, and to save time I only want the regressions to run at the last day of each month in the expanding style (i.e., not run regression every day). I'd appreciate if someone can help point out a way to improve the code here.
Create a function coef2 that given a yearmon object computes the second regression coeffient up to dates in that year and month.
library(zoo)
coef2 <- function(ym, data) {
coef(lm(ret ~ X, data, subset = as.yearmon(date) <= ym))[[2]]
}
yearmonth <- unique(as.yearmon(df$date))
data.frame(yearmonth, slope = sapply(yearmonth, coef2, data = df))
## yearmonth slope
## 1 Jan 2011 0.2208940
## 2 Feb 2011 0.1792896
## 3 Mar 2011 0.1180308
If performance is an issue you could try t his alternative version of coef2 which avoids the use of lm :
coef2 <- function(ym, data) {
with(subset(data, as.yearmon(date) <= ym), cov(ret, X) / var(X))
}

Trying to forecast the next 24hrs temperature using UCI repository datasets in R programming

Hi I accessed the datasets from the UCI repository http://archive.ics.uci.edu/ml/datasets/Air+Quality
I am trying to predict the next 24hrs temperature.Below is the code which I have written
filling the missing values by NA
library(plyr)
AirQualityUCI[AirQualityUCI==-200.0]<-NA
Replacing the NA by mean of each columns
for(i in 1:ncol(AirQualityUCI)){
AirQualityUCI[is.na(AirQualityUCI[,i]),i] <- mean(AirQualityUCI[,i], na.rm = TRUE)
}
plot time series
plot(AirQualityUCI$T, type = "l")
How do I set the frequency in hours and predict the temperature of next 24hrs ?
Tempts <- ts(AirQualityUCI)
Temprforecasts <- HoltWinters(Tempts, beta=FALSE, gamma=FALSE)
library(forecast)
accuracy(Temprforecasts,24)
Getting the below error
Error in attr(x, "tsp") <- value :
invalid time series parameters specified
library(readxl)
AirQualityUCI <- read_excel("AirQualityUCI.xlsx")
library(plyr)
AirQualityUCI[AirQualityUCI==-200.0]<-NA
#First, limit to the one column you are interested in (make sure data is sorted by time variable before doing this)
library(data.table)
temp <- setDT(AirQualityUCI)[,c("T")]
#Replace NA with mean
temp$T <- ifelse(is.na(temp$T), mean(temp$T, na.rm=TRUE), temp$T)
#Create time series object...in this case freq = 365 * 24 (hours in year)
Tempts <- ts(temp, frequency = 365*24)
#Model
Temprforecasts <- HoltWinters(Tempts, beta = FALSE, gamma = FALSE)
#Generate next 24 hours forecast
library(forecast)
output.forecast <- forecast.HoltWinters(Temprforecasts, h = 24)

Fitting ARIMA model to multiple time series and storing forecast into a matrix

As it is large I can't dput it here. But suppose the realmatrix is a "mts" with non-trivial values
realmatrix <- matrix(NA, ncol = 100, nrow = 138)
In fact it stores 100 time series with length (rows) = 138 (from Jan 2005 to June 2016).
I want to store the Arima forecasts (12 months ahead: that is, from July 2016 to June 2017) in another matrix farimamatrix (which should have 12 rows and 100 columns), via the following loop:
farimamatrix <- matrix(NA, nrow = 12, ncol = 100)
m <- k <- list()
for (i in 1:100) {
try(m[[i]] <- Arima(realmatrix[,i], order = c(0,1,0), seasonal = c(1,0,1)))
k[[i]] <- forecast.Arima(m[[i]], h=12)
farimamatrix[,i] <- fitted(k[[i]])
}
But I am getting the following message:
Error in farimamatrix[, i] <- fitted(k[[i]]) :
incorrect number of subscripts on matrix
What's wrong? Thanks in advance.
Edited (24/10): updated / corrected under Zheyuan's answer and previous problem gone
Original data:
tsdata <-
structure(c(28220L, 27699L, 28445L, 29207L, 28482L, 28326L, 28322L,
28611L, 29187L, 29145L, 29288L, 29352L, 28881L, 29383L, 29898L,
29888L, 28925L, 29069L, 29114L, 29886L, 29917L, 30144L, 30531L,
30494L, 30700L, 30325L, 31313L, 32031L, 31383L, 30767L, 30500L,
31181L, 31736L, 32136L, 32654L, 32305L, 31856L, 31731L, 32119L,
31953L, 32300L, 31743L, 32150L, 33014L, 32964L, 33674L, 33410L,
31559L, 30667L, 30495L, 31978L, 32043L, 30945L, 30715L, 31325L,
32262L, 32717L, 33420L, 33617L, 34123L, 33362L, 33731L, 35118L,
35027L, 34298L, 34171L, 33851L, 34715L, 35184L, 35190L, 35079L,
35958L, 35875L, 35446L, 36352L, 36050L, 35567L, 35161L, 35419L,
36337L, 36967L, 36745L, 36370L, 36744L, 36303L, 36899L, 38621L,
37994L, 36809L, 36527L, 35916L, 37178L, 37661L, 37794L, 38642L,
37763L, 38367L, 38006L, 38442L, 38654L, 38345L, 37628L, 37698L,
38613L, 38525L, 39389L, 39920L, 39556L, 40280L, 41653L, 40269L,
39592L, 39100L, 37726L, 37867L, 38551L, 38895L, 40100L, 40950L,
39838L, 40643L, 40611L, 39611L, 39445L, 38059L, 37131L, 36697L,
37746L, 37733L, 39188L, 39127L, 38554L, 38219L, 38497L, 39165L,
40077L, 38370L, 37174L), .Dim = c(138L, 1L), .Dimnames = list(
NULL, "Data"), .Tsp = c(2005, 2016.41666666667, 12), class = "ts")
Code
library("forecast")
z <- stl(tsdata[, "Data"], s.window="periodic")
t <- z$time.series[,"trend"]
s <- z$time.series[,"seasonal"]
e <- z$time.series[,"remainder"]
# error matrix
ematrix <- matrix(rnorm(138 * 100, sd = 100), nrow = 138)
# generating a ts class error matrix
ematrixts <- ts(ematrix, start=c(2005,1), freq=12)
# combining the trend + season + error matrix into a real matrix
realmatrix <- t + s + ematrixts
# creating a (forecast) arima matrix
farimamatrix <- matrix(NA, ncol = 100, nrow = 12)
m <- k <- vector("list", length = 100)
for (i in 1:100) {
try(m[[i]] <- Arima(realmatrix[,i], order = c(0,1,0), seasonal = c(1,0,1)))
print(i)
k[[i]] <- forecast.Arima(m[[i]], h = 12)
farimamatrix[,i] <- k[[i]]$mean
}
# ts.plot(farimamatrix[,1:100],col = c(rep("gray",100),rep("red",1)))
The loop seems to work, but breaks down after a few iterations due to failure of Arima:
Error in stats::arima(x = x, order = order, seasonal = seasonal, include.mean = include.mean, : " non-stationary seasonal AR part from CSS
Yep, the previous problem is gone, and now you have a new problem, regarding the failure of Arima. Strictly speaking you should raise a new question on this. But I will answer it here anyway.
The error message is quite illustrative. When you fit a model ARIMA(0,1,0)(1,0,1), sometimes the seasonal part is non-stationary, so a further seasonal differencing is needed.
By looking at ts.plot(realmatrix),I see that all 100 columns of realmatrix are pretty similar. I will thus take out the first column for some analysis.
x <- realmatrix[,1]
Obviously the non-seasonal differencing is a must, but do we need a seasonal differencing as well? Have a check with ACF
acf(diff(x))
We actually spotted strong evidence that for the seasonal pattern. So yes, a seasonal differencing is needed.
Now let's check the ACF after both differencing:
acf(diff(diff(x, lag = 12))) ## first do seasonal diff, then non-seasonal diff
There appears to be a negative spike between season, suggesting a seasonal MA process. So ARIMA(0,1,0)(0,1,1)[12] would be a good bet.
fit <- arima(x, order = c(0,1,0), seasonal = c(0,1,1))
Have a check at the residuals:
acf(fit$residuals)
I would actually be pretty happy about this result, as there is no lag 1 or even lag 2 autocorrelation at all, and there is also no seasonal autocorrelation. You can actually try further adding a seasonal and / or non-seasonal AR(1), but there will be no improvement. So this is our final model to go.
So use the following loop:
farimamatrix <- matrix(NA, ncol = 100, nrow = 12)
m <- k <- vector("list", length = 100)
for (i in 1:100) {
m[[i]] <- Arima(realmatrix[,i], order = c(0,1,0), seasonal = c(0,1,1))
print(i)
k[[i]] <- forecast.Arima(m[[i]], h = 12)
farimamatrix[,i] <- k[[i]]$mean
}
Now all 100 model fitting are successful.
---------
A retrospect reflection
Perhaps I should explain why ARIMA(0,1,0)(1,0,1)[12] models works for my simulated data in the initial answer. Because note how I simulate my data:
seasonal <- rep_len(sin((1:12) * pi / 6), 138)
Yes, the underlying seasonal pattern is a true replication and of course stationary.

Time Series stl error

I have a code that goes like this:
rawdata=as.numeric(rawdata)
salesdata_bfr=rawdata[3:(maxcolnum-12)]
print(length(salesdata_bfr))
salesdata_ts=ts(salesdata_bfr, frequency = 12)
salesdata_stl=stl(salesdata_ts,s.window="periodic")
maxcolnum is equal to 38 and print(length(salesdata_bfr)) prints 24. But I get the error
Error in stl(salesdata_ts, s.window = "periodic") :
series is not periodic or has less than two periods
But I do have a vector of exactly two periods and I specified the frequency in ts(). Why won't it work?
There is a problem with your code. The following works smoothly:
set.seed(42)
maxcolnum <- 38
rawdata <- rnorm(maxcolnum)
salesdata_bfr <- rawdata[3:(maxcolnum-11)]
length(salesdata_bfr)
# [1] 25 <-- it should be 25, not 24
salesdata_ts <- ts(salesdata_bfr, frequency = 12)
salesdata_stl <- stl(salesdata_ts,s.window="periodic")

Resources