Specify seasonal ARIMA - r

I am having some forecast::Arima-syntax issues. If I know that a seasonal ARIMA is statistically ok because it is the result of auto.arima, how can I fix the following Arima-function to have the same order as the auto.arima result:
library(forecast)
set.seed(1)
y <- sin((1:40)) * 10 + 20 + rnorm(40, 0, 2)
my_ts <- ts(y, start = c(2000, 1), freq = 12)
fit_auto <- auto.arima(my_ts, max.order = 2)
plot(forecast(fit_auto, h = 24))
# Arima(0,0,1)(1,0,0) with non-zero mean
fit_arima <- Arima(my_ts,
order = c(0, 0, 1),
seasonal = list(c(1, 0, 0)))
#Error in if ((order[2] + seasonal$order[2]) > 1 & include.drift) { :
# argument is of length zero
Thx & kind regards

The argument to seasonal must be either a numeric vector giving the seasonal order, or a list with two named elements: order, the numeric vector giving the seasonal order, and period, an integer giving the seasonal periodicity.
You gave a list with only the seasonal order, so Arima is complaining it couldn't find the period value. If you give a numeric vector, period will default to frequency(my_ts) like it says in the function's documentation. While it does make sense that giving just the order as a numeric or as a list should have the same result, it doesn't. Just a quirk of this function.
A rewrite of your call that works:
fit_arima <- Arima(my_ts,
order = c(0, 0, 1),
seasonal = c(1, 0, 0)) # vector, not a list

Related

R: forecast::accuracy() Vs Metrics::accuracy() Functions Results Not the Same

I am testing for the RMSE of a forecast and observed that the two forecast::accuracy()[2] and Metrics::accuracy() are not the same. In fact, the latter is even 0
set.seed(289805)
ts1 <- arima.sim(n = 10, model = list(ar = 0.8, order = c(1, 0, 0)), sd = 1) # the series I want to forecast for
train_ts1 <- head(ts1, length(ts1) - 2) # the part of series I want to project int the future time
test_ts1 <- tail(ts1, length(ts1) - length(train_ts1)) # the part of series I want to compare my forecast with
set.seed(837530)
ts2 <- arima.sim(n = 10, model = list(ma = 0.8, order = c(0, 0, 1)), sd = 1) # the second series, part of which I want to train
train_ts2 <- head(ts2, length(ts2) - 2) # trainning part of second series
test_ts2 <- tail(ts2, length(ts2) - length(train_ts2)) # do not seem to need this part of the series
fcast <- forecast::forecast(train_ts1, model = forecast::auto.arima(train_ts2), h = 2)$mean # my
forecast using the best model from trainning set of second series
forecast::accuracy(fcast, test_ts1)[2] # RMSE for the forecast
# [1] 0.6412488
Metrics::accuracy(test_ts1, fcast)
# [1] 0
Please what am I doing wrong?

Time series in R: How to set week period across two years?

I have time series with week period (7 days) across two years. I have 58 values. Start is: 2017-08-05, end: 2018-09-08. I need work with this time series in R - make predictions with SARIMA model etc. But I have problem with define period/frequency in R. When I use decompose function I get error: "time series has no or less than 2 periods". Arima function does not work properly. Detailed information are bellow. What way I can import my data for use in R with requested frequency please?
My data (short example):
File: sessions2.csv
date count
11.11.2017 55053
18.11.2017 45256
25.11.2017 59091
2.12.2017 50030
9.12.2017 41769
16.12.2017 63042
23.12.2017 51838
30.12.2017 47652
6.1.2018 18731
13.1.2018 54470
20.1.2018 22514
27.1.2018 63818
3.2.2018 51605
10.2.2018 26312
17.2.2018 11111
data1.csv contains only values. For example:
53053
45256
59091
50045
41769
65042
51838
I tried in R:
sessions1 <- scan("data1.csv")
sessionsTS <- ts(sessions1, frequency=52, start=decimal_date(ymd("2017-11-11")))
Output sessionsTS and errors:
> sessionsTS
Time Series:
Start = 2017.59178082192
End = 2018.68418328598
Frequency = 52
What time format represent these numbers (Start, End) please? And what way I can use for convert to decimal date?
> sessionsComponents <- decompose(sessionsTS)
Error in decompose(sessionsTS) :
time series has no or less than 2 periods
> arima(sessionsTS, order = c(0, 1, 0), seasonal = list(order = c(2, 0, 0), period = 52), xreg = NULL, include.mean = TRUE)
Error in optim(init[mask], armaCSS, method = optim.method, hessian = FALSE, :
initial value in 'vmmin' is not finite
> fit <- Arima(sessionsTS, order = c(0, 1, 0), seasonal = list(order = c(2, 0, 0), period = 52))
Error in optim(init[mask], armaCSS, method = optim.method, hessian = FALSE, :
initial value in 'vmmin' is not finite
> sarima(sessionsTS,1,1,0,2,0,0,52)
Error in sarima(sessionsTS, 1, 1, 0, 2, 0, 0, 52) :
unused arguments (0, 0, 52)
Next I tried:
dataSeries <- read.table("sessions2.csv", header=TRUE, sep = ";", row.names=1)
dataTS <- as.xts(dataSeries , frequency=52, start=decimal_date(ymd("2017-11-11")))
> sessionsComponents2 <- decompose(dataTS)
Error in decompose(dataTS) : time series has no or less than 2 periods
> model = Arima(dataTS, order=c(0,1,0), seasonal = c(2,0,0))
> model
Series: dataTS
ARIMA(0,1,0)
In this case Arima is used without seasonality...
Many thanks for help.
Your data is sampled weekly, so if the period also is one week you need to set frequency=1, but at that point there is no point in doing seasonal modeling. It makes sense to have a yearly period, as you have done by setting frequency=52, but then you don't have enough periods for doing any estimations, you'd need at least 104 observations (at least two periods, as the error message explains) for that.
So in short, you can't do what you want to do do unless you get more data.
A partial answer for your questions about ts() and the time format. If you do it like this:
tt <- read.table(text="
date count
11.11.2017 55053
18.11.2017 45256
25.11.2017 59091
2.12.2017 50030
9.12.2017 41769
16.12.2017 63042
23.12.2017 51838
30.12.2017 47652
6.1.2018 18731
13.1.2018 54470
20.1.2018 22514
27.1.2018 63818
3.2.2018 51605
10.2.2018 26312
17.2.2018 11111", header=TRUE)
tt$date <- as.Date(tt$date, format="%d.%m.%Y")
ts(tt$count, frequency=52, start=c(2017, 45))
# Time Series:
# Start = c(2017, 45)
# End = c(2018, 7)
# Frequency = 52
# [1] 55053 45256 59091 50030 41769 63042 51838 47652 18731
# 54470 22514 63818 51605 26312 11111
The start is at the 45'th week of 2017, and the end is at the 7'th week of 2018.
You can find the weeknumbers using format(tt$date, "%W"). Look at ?strptime for more details and to see what %W means.

Applying 'clustering functions' to a series of linear models

I want to iterate over a list of linear models and apply "clustered" standard errors to each model using the vcovCL function. My goal is to do this as efficiently as possible (I am running a linear model across many columns of a dataframe). My problem is trying to specify additional arguments inside of the anonymous function. Below I simulate some fake data. Precincts represent my cross-sectional dimension; months represent my time dimension (5 units observed across 4 months). The variable int is a dummy for when an intervention takes place.
df <- data.frame(
precinct = c( rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4), rep(5, 4) ),
month = rep(1:4, 5),
crime = rnorm(20, 10, 5),
int = c(c(0, 1, 1, 0), rep(0, 4), rep(0, 4), c(1, 1, 1, 0), rep(0, 4))
)
df[1:10, ]
outcome <- df[3]
est <- lapply(outcome, FUN = function(x) { lm(x ~ as.factor(precinct) + as.factor(month) + int, data = df) })
se <- lapply(est, function(x) { sqrt(diag(vcovCL(x, cluster = ~ precinct + month))) })
I receive the following error message when adding the cluster argument inside of the vcovCL function.
Error in eval(expr, envir, enclos) : object 'x' not found
The only way around it, in my estimation, would be to index the dataframe, i.e., df$, and then specify the 'clustering' variables. Could this be achieved by specifying an additional argument for df inside of the function call? Is this code efficient?
Maybe specifying the model equation formulaically is a better way to go, I suppose.
Any thoughts/comments are always helpful :)
Here is one approach that would retrieve clustered standard errors for multiple models:
library(sandwich)
# I am going to use the same model three times to get the "sequence" of linear models.
mod <- lm(crime ~ as.factor(precinct) + as.factor(month) + int, data = df)
# define function to retrieve standard errors:
robust_se <- function(mod) {sqrt(diag(vcovCL(mod, cluster = list(df$precinct, df$month))))}
# apply function to all models:
se <- lapply(list(mod, mod, mod), robust_se)
If you want to get the entire output adjusted, the following might be helpful:
library(lmtest)
adj_stats <- function(mod) {coeftest(mod, vcovCL(mod, cluster = list(df$precinct, df$month)))}
adjusted_models <- lapply(list(mod, mod, mod), adj_stats)
To address the multiple column issue:
In case you are struggling with running linear models over several columns, the following might be helpful. All the above would stay the same, except that you are passing your list of models to lapply.
First, let's use this dataframe here:
df <- data.frame(
precinct = c( rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4), rep(5, 4) ),
month = rep(1:4, 5),
crime = rnorm(20, 10, 5),
crime2 = rnorm(20, 10, 5),
crime3 = rnorm(20, 10, 5),
int = c(c(0, 1, 1, 0), rep(0, 4), rep(0, 4), c(1, 1, 1, 0), rep(0, 4))
)
Let's define the outcome columns:
outcome_columns <- c("crime", "crime2", "crime3")
Now, let's run a regression with each outcome:
models <- lapply(outcome_columns,
function(outcome) lm( eval(parse(text = paste0(outcome, " ~ as.factor(precinct) + as.factor(month) + int"))), data = df) )
And then you would just call
adjusted_models <- lapply(models, adj_stats)
Regarding efficiency:
The above code is efficient in that it is easily adjustable and quick to write up. For most use cases, it will be perfectly fine. For computational efficiency, note that your design matrix is the same in all cases, i.e. by precomputing the common elements (e.g. inv(X'X)*X'), you could save some computations. You would however lose out on the convenience of many built-in functions.

Package msm: segmentation-fault when introducing covariates

While using the package msm, I am currently getting the error:
* caught segfault * address 0x7f875be5ff48, cause 'memory not mapped'
when I introduce a covariate to my model. Previously, I had resolved a similar error by converting my response variable from a factor to a numeric variable. This however does not resolve my current issue.
The data <- https://www.dropbox.com/s/wx6s4liofaxur0v/data_msm.txt?dl=0
library(msm)
#number of transitions between states
#1: healthy; 2: ill; 3: dead; 4: censor
statetable.msm(state_2, id, data=dat.long)
#setting initial values
q <- rbind(c(0, 0.25, 0.25), c(0.25, 0, 0.25), c(0, 0, 0))
crudeinits <- crudeinits.msm(state_2 ~ time, subject=id, data=dat.long, qmatrix=q, censor = 4, censor.states = c(1,2))
#running model without covariates
(fm1.msm <- msm(state_2 ~ time, subject = id, qmatrix = crudeinits, data = dat.long, death = 3, censor = 4, censor.states = c(1,2)))
#running model with covariates
(fm2.msm <- msm(state_2 ~ time, subject = id, qmatrix = crudeinits, data = dat.long, covariates = ~ gender, death = 3, censor = 4, censor.states = c(1,2)))
Alternatively, I can run the models with covariates if I set the state values dead and censor (3 & 4) to missing.
#set death and censor to missing
dat.long$state_2[dat.long$state_2 %in% c(3,4)] <- NA
statetable.msm(state_2, id, data=dat.long)
#setting initial values
q <- rbind(c(0, 0.5), c(0.5, 0))
crudeinits <- crudeinits.msm(state_2 ~ time, subject=id, data=dat.long, qmatrix=q)
#running models with covariates
(fm3.msm <- msm(state_2 ~ time, subject = id, qmatrix = crudeinits, data = dat.long, covariates = ~ gender))
(fm4.msm <- msm(state_2 ~ time, subject = id, qmatrix = crudeinits, data = dat.long, covariates = ~ covar))
Thanks for your help
In version 1.5 of msm, there's an error in the R code that detects and drops NAs in the data. This is triggered when there are covariates, and the state or time variable contains NAs. Those NAs can then be passed through to the C code that computes the likelihood, causing a crash. I'll fix this for the next version. In the meantime you can work around it by dropping NAs from the data before calling msm.

Functional Regression without intercept in R

I am doing a functional regression in R (package fda)and am supposed to eliminate the intercept term. But the fda package in R seems have no such formula.
Here is what I wish to do:
fit.fd <- fRegress(Acc.fd~Velo.fd - 1)
where Acc.fd and Velo.fd are two functional objects in the package fda. But it is no different from:
fit.fd <- fRegress(Acc.fd~Velo.fd)
Since the result is deeply nested, I am adding an example so the codes could be run on a small scale and detail of result could be generated.
list3d <- rep(0, 10*5*2)
list3d <- array(list3d, c(10,5, 2))
# The data is 5 functions each evaluated at 10 points
# Indep variable
list3d[, , 2] <- matrix(rnorm(50, 0, 1), 10, 5)
# Response variable
list3d[, , 1] <- matrix(rnorm(50, 0, 0.1) , 10, 5)+list3d[, , 2] ^ 2
dimnames(list3d)[[1]] <- seq(0,9)
time.range <- c(0, 9)
time.basis <- create.fourier.basis(time.range, nbasis = 3)
lfd <- vec2Lfd(c(0, (2*pi/20)^2, 0), rangeval = time.range)
time.lfd<- smooth.basisPar(seq(0,9), list3d , time.basis, Lfdobj = lfd, lambda = 0.01)$fd
Acc.fd <- time.lfd[, 1]
Velo.fd <- time.lfd[, 2]
# Expecting to see without intercept here
fit.fd <- fRegress(Acc.fd ~ Velo.fd - 1)
# plot of coef func
plot(plotpoints, eval.fd(plotpoints, fit.fd$betaestlis$Velo.fd$fd))
# Plot of intercept func, I wish to limit it to zero
plot(plotpoints, eval.fd(plotpoints, fit.fd$betaestlis$const$fd))
# Compare with regular functional regression with no restriction
fit.fd <- fRegress(Acc.fd ~ Velo.fd)
plot(plotpoints, eval.fd(plotpoints, fit.fd$betaestlis$Velo.fd$fd))
So the no intercept option does not work the same way as in lm? Could anyone helps me out here? Many thanks!

Resources