For the following time series data:
#1. dates of 15 day frequency:
dates = seq(as.Date("2016-09-01"), as.Date("2020-07-30"), by=15) #96 times observation
#2. water content in crops corresponding to the times given.
water <- c(0.5702722, 0.5631781, 0.5560839, 0.5555985, 0.5519783, 0.5463459,
0.5511598, 0.546652, 0.5361545, 0.530012, 0.5360571, 0.5396569,
0.5683526, 0.6031535, 0.6417821, 0.671358, 0.7015542, 0.7177007,
0.7103561, 0.7036985, 0.6958607, 0.6775161, 0.6545367, 0.6380155,
0.6113306, 0.5846186, 0.5561815, 0.5251135, 0.5085149, 0.495352,
0.485819, 0.4730029, 0.4686458, 0.4616468, 0.4613918, 0.4615532,
0.4827496, 0.5149105, 0.5447824, 0.5776764, 0.6090217, 0.6297454,
0.6399422, 0.6428941, 0.6586344, 0.6507473, 0.6290631, 0.6011123,
0.5744375, 0.5313527, 0.5008027, 0.4770338, 0.4564025, 0.4464508,
0.4309046, 0.4351668, 0.4490393, 0.4701232, 0.4911582, 0.5162941,
0.5490387, 0.5737573, 0.6031149, 0.6400073, 0.6770058, 0.7048311,
0.7255012, 0.739107, 0.7338938, 0.7265202, 0.6940718, 0.6757214,
0.6460862, 0.6163091, 0.5743775, 0.5450822, 0.5057753, 0.4715266,
0.4469859, 0.4303232, 0.4187793, 0.4119401, 0.4201316, 0.426369,
0.4419331, 0.4757525, 0.5070846, 0.5248457, 0.5607567, 0.5859825,
0.6107531, 0.6201754, 0.6356589, 0.6336177, 0.6275579, 0.6214981)
I want to fit a double-logistic function curve to the data.
I found some examples and packages that can be of help,
https://greenbrown.r-forge.r-project.org/man/FitDoubleLogElmore.html
and an example here - Indexes overlap error when using dplyr to run a function.
However, the examples given only consider annual time series.
I have tried to fit the function as:
x <- ts(water, start = c(2016,17), end = c(2020, 16), frequency = 24)
smooth.water = FitDoubleLogBeck(x, weighting = T, hessian = F, plot = T, ninit = 10)
plot(water)
plot(smooth.water$predicted)
plot(water- smooth.water$predicted)
However, this function does not seem to fit the entire time series. How can I run the function to fit the entire time series? Also, I noticed the output is different at different run, and I am not sure what makes that happen.
FitDoubleLogBeck can deal only with 1-year data, so you need analyze the data year by year. To do it just take window for 1 year then fit the data separately for each year.
As for different results at different runs the algorithm randomly chooses the initial parameters. The graph of double logistic curve is bell shaped. However you applying the algorithm to "sine"-like data but the algorithm expects to have "bell". Then it treats the water data as a cloud of points so the results are meaningless and very sensetive to intial parameter setting.
Code:
set.seed(123)
par(mfrow = c(1, 3))
# water vector taken from question above
x <- ts(water, start = c(2016,17), end = c(2020, 16), frequency = 24)
res <- sapply((2017:2019), function(year) {
x2 <- as.vector(window(x, start=c(year, 1), end=c(year, 24)))
smooth.water2 = FitDoubleLogBeck(x2, weighting = T, hessian = F, plot = T, ninit = 10)
title(main = year)
c(year = year, smooth.water2$params)
})
t(res)
Output:
year mn mx sos rsp eos rau
[1,] 2017 -0.7709318 0.17234293 16.324163 -0.6133117 6.750885 -0.7618376
[2,] 2018 -0.8900971 0.09398673 7.529345 0.6701200 17.319465 0.8277409
[3,] 2019 -4.7669470 -0.34648434 15.930455 -0.2570877 10.690043 -0.2267284
I want to perform an expanding regression at monthly frequency using daily data. The model is:
ret = \beta_0 + \beta_1 X + \varepsilon
Sample data and my attempt:
library(zoo)
df = data.frame(date = seq(as.Date('2011-01-01'),as.Date('2011-03-31'),by = 1), ret = rnorm(90,0,1), X = rnorm(90,0,1))
roll = function(data, n = now(data) {
rollapplyr(1:n, 1:n, function(x) coef(lm(ret ~ X, data, subset =x))[[2]]
}
output = df %>%
mutate(coefficient = roll(data.frame(ret, X)))
The code above runs expanding regression by row and I could extract just the last value in each month to get the coefficients for that month (i.e., in this example, I only need coefficients estimated on Jan-31, Feb-28 and Mar-31).
However, I need to apply this code to a large dataset, and to save time I only want the regressions to run at the last day of each month in the expanding style (i.e., not run regression every day). I'd appreciate if someone can help point out a way to improve the code here.
Create a function coef2 that given a yearmon object computes the second regression coeffient up to dates in that year and month.
library(zoo)
coef2 <- function(ym, data) {
coef(lm(ret ~ X, data, subset = as.yearmon(date) <= ym))[[2]]
}
yearmonth <- unique(as.yearmon(df$date))
data.frame(yearmonth, slope = sapply(yearmonth, coef2, data = df))
## yearmonth slope
## 1 Jan 2011 0.2208940
## 2 Feb 2011 0.1792896
## 3 Mar 2011 0.1180308
If performance is an issue you could try t his alternative version of coef2 which avoids the use of lm :
coef2 <- function(ym, data) {
with(subset(data, as.yearmon(date) <= ym), cov(ret, X) / var(X))
}
As it is large I can't dput it here. But suppose the realmatrix is a "mts" with non-trivial values
realmatrix <- matrix(NA, ncol = 100, nrow = 138)
In fact it stores 100 time series with length (rows) = 138 (from Jan 2005 to June 2016).
I want to store the Arima forecasts (12 months ahead: that is, from July 2016 to June 2017) in another matrix farimamatrix (which should have 12 rows and 100 columns), via the following loop:
farimamatrix <- matrix(NA, nrow = 12, ncol = 100)
m <- k <- list()
for (i in 1:100) {
try(m[[i]] <- Arima(realmatrix[,i], order = c(0,1,0), seasonal = c(1,0,1)))
k[[i]] <- forecast.Arima(m[[i]], h=12)
farimamatrix[,i] <- fitted(k[[i]])
}
But I am getting the following message:
Error in farimamatrix[, i] <- fitted(k[[i]]) :
incorrect number of subscripts on matrix
What's wrong? Thanks in advance.
Edited (24/10): updated / corrected under Zheyuan's answer and previous problem gone
Original data:
tsdata <-
structure(c(28220L, 27699L, 28445L, 29207L, 28482L, 28326L, 28322L,
28611L, 29187L, 29145L, 29288L, 29352L, 28881L, 29383L, 29898L,
29888L, 28925L, 29069L, 29114L, 29886L, 29917L, 30144L, 30531L,
30494L, 30700L, 30325L, 31313L, 32031L, 31383L, 30767L, 30500L,
31181L, 31736L, 32136L, 32654L, 32305L, 31856L, 31731L, 32119L,
31953L, 32300L, 31743L, 32150L, 33014L, 32964L, 33674L, 33410L,
31559L, 30667L, 30495L, 31978L, 32043L, 30945L, 30715L, 31325L,
32262L, 32717L, 33420L, 33617L, 34123L, 33362L, 33731L, 35118L,
35027L, 34298L, 34171L, 33851L, 34715L, 35184L, 35190L, 35079L,
35958L, 35875L, 35446L, 36352L, 36050L, 35567L, 35161L, 35419L,
36337L, 36967L, 36745L, 36370L, 36744L, 36303L, 36899L, 38621L,
37994L, 36809L, 36527L, 35916L, 37178L, 37661L, 37794L, 38642L,
37763L, 38367L, 38006L, 38442L, 38654L, 38345L, 37628L, 37698L,
38613L, 38525L, 39389L, 39920L, 39556L, 40280L, 41653L, 40269L,
39592L, 39100L, 37726L, 37867L, 38551L, 38895L, 40100L, 40950L,
39838L, 40643L, 40611L, 39611L, 39445L, 38059L, 37131L, 36697L,
37746L, 37733L, 39188L, 39127L, 38554L, 38219L, 38497L, 39165L,
40077L, 38370L, 37174L), .Dim = c(138L, 1L), .Dimnames = list(
NULL, "Data"), .Tsp = c(2005, 2016.41666666667, 12), class = "ts")
Code
library("forecast")
z <- stl(tsdata[, "Data"], s.window="periodic")
t <- z$time.series[,"trend"]
s <- z$time.series[,"seasonal"]
e <- z$time.series[,"remainder"]
# error matrix
ematrix <- matrix(rnorm(138 * 100, sd = 100), nrow = 138)
# generating a ts class error matrix
ematrixts <- ts(ematrix, start=c(2005,1), freq=12)
# combining the trend + season + error matrix into a real matrix
realmatrix <- t + s + ematrixts
# creating a (forecast) arima matrix
farimamatrix <- matrix(NA, ncol = 100, nrow = 12)
m <- k <- vector("list", length = 100)
for (i in 1:100) {
try(m[[i]] <- Arima(realmatrix[,i], order = c(0,1,0), seasonal = c(1,0,1)))
print(i)
k[[i]] <- forecast.Arima(m[[i]], h = 12)
farimamatrix[,i] <- k[[i]]$mean
}
# ts.plot(farimamatrix[,1:100],col = c(rep("gray",100),rep("red",1)))
The loop seems to work, but breaks down after a few iterations due to failure of Arima:
Error in stats::arima(x = x, order = order, seasonal = seasonal, include.mean = include.mean, : " non-stationary seasonal AR part from CSS
Yep, the previous problem is gone, and now you have a new problem, regarding the failure of Arima. Strictly speaking you should raise a new question on this. But I will answer it here anyway.
The error message is quite illustrative. When you fit a model ARIMA(0,1,0)(1,0,1), sometimes the seasonal part is non-stationary, so a further seasonal differencing is needed.
By looking at ts.plot(realmatrix),I see that all 100 columns of realmatrix are pretty similar. I will thus take out the first column for some analysis.
x <- realmatrix[,1]
Obviously the non-seasonal differencing is a must, but do we need a seasonal differencing as well? Have a check with ACF
acf(diff(x))
We actually spotted strong evidence that for the seasonal pattern. So yes, a seasonal differencing is needed.
Now let's check the ACF after both differencing:
acf(diff(diff(x, lag = 12))) ## first do seasonal diff, then non-seasonal diff
There appears to be a negative spike between season, suggesting a seasonal MA process. So ARIMA(0,1,0)(0,1,1)[12] would be a good bet.
fit <- arima(x, order = c(0,1,0), seasonal = c(0,1,1))
Have a check at the residuals:
acf(fit$residuals)
I would actually be pretty happy about this result, as there is no lag 1 or even lag 2 autocorrelation at all, and there is also no seasonal autocorrelation. You can actually try further adding a seasonal and / or non-seasonal AR(1), but there will be no improvement. So this is our final model to go.
So use the following loop:
farimamatrix <- matrix(NA, ncol = 100, nrow = 12)
m <- k <- vector("list", length = 100)
for (i in 1:100) {
m[[i]] <- Arima(realmatrix[,i], order = c(0,1,0), seasonal = c(0,1,1))
print(i)
k[[i]] <- forecast.Arima(m[[i]], h = 12)
farimamatrix[,i] <- k[[i]]$mean
}
Now all 100 model fitting are successful.
---------
A retrospect reflection
Perhaps I should explain why ARIMA(0,1,0)(1,0,1)[12] models works for my simulated data in the initial answer. Because note how I simulate my data:
seasonal <- rep_len(sin((1:12) * pi / 6), 138)
Yes, the underlying seasonal pattern is a true replication and of course stationary.