Daily timeseries forecasting, with weekly and annual cycle - r

My aim is to forecast the daily number of registrations in two different channels.
Week seasonality is quite strong, especially the weekends and also observed annual effects. Moreover, I have a few special event days, which significantly differ from the others days.
First, I applied a TBATS model on these two channels.
x.msts <- msts(Channel1_reg,seasonal.periods=c(7,365.25))
# fit model
fit <- tbats(x.msts)
fit
plot(fit)
forecast_channel1 <- forecast(fit,h=30)
First channel:
TBATS(0, {2,3}, -, {<7,3>, <365.25,2>})
Call: tbats(y = x.msts)
Parameters
Lambda: 0
Alpha: 0.0001804516
Gamma-1 Values: -1.517954e-05 1.004701e-05
Gamma-2 Values: -3.059654e-06 -2.796211e-05
AR coefficients: 0.249944 0.544593
MA coefficients: 0.215696 -0.361379 -0.21082
Second channel:
BATS(0, {2,2}, 0.929, -)
Call: tbats(y = y.msts)
Parameters
Lambda: 0
Alpha: 0.1652762
Beta: -0.008057904
Damping Parameter: 0.928972
AR coefficients: -0.586163 -0.676921
MA coefficients: 0.924758 0.743675
If I forecast the second channel, I only get blank values instead of any forecasts.
Could you please help why is that so?
Do you have any suggestion how to build in the specific event days into this model?
Thank you all!

tbats and bats are occasionally unstable, and your second model is showing infinite forecasts. There are already some bug reports about similar issues.
In any case, as you want to use event information, you would be better building a harmonic regression model with ARMA errors.
For example, suppose your event information is recorded as a dummy variable event1. Then the model can be fitted as follows:
harmonics <- fourier(x.msts, K=c(2,2))
fit1 <- auto.arima(x.msts, lambda=0,
xreg=cbind(harmonics,event1), seasonal=FALSE)
f1 <- forecast(fit1,
xreg=cbind(fourierf(x.msts, K=c(2,2), h=200), rep(0,200)))
This assumes that the event will not occur in the next 200 days (hence the 200 0s). I have used harmonics of order 2 for both weeks and years. Adjust these to minimize the AICc of the model.
This model is actually very similar to the TBATS model you are fitting except that the lambda value has been specified rather than estimated, and the seasonality is fixed over time rather than being allowed to evolve. The advantage is that the harmonic regression model tends to be more stable, and allows covariates to be included.

Related

Is there a way to include an autocorrelation structure in the gam function of mgcv?

I am building a model using the mgcv package in r. The data has serial measures (data collected during scans 15 minutes apart in time, but discontinuously, e.g. there might be 5 consecutive scans on one day, and then none until the next day, etc.). The model has a binomial response, a random effect of day, a fixed effect, and three smooth effects. My understanding is that REML is the best fitting method for binomial models, but that this method cannot be specified using the gamm function for a binomial model. Thus, I am using the gam function, to allow for the use of REML fitting. When I fit the model, I am left with residual autocorrelation at a lag of 2 (i.e. at 30 minutes), assessed using ACF and PACF plots.
So, we wanted to include an autocorrelation structure in the model, but my understanding is that only the gamm function and not the gam function allows for the inclusion of such structures. I am wondering if there is anything I am missing and/or if there is a way to deal with autocorrelation with a binomial response variable in a GAMM built in mgcv.
My current model structure looks like:
gam(Response ~
s(Day, bs = "re") +
s(SmoothVar1, bs = "cs") +
s(SmoothVar2, bs = "cs") +
s(SmoothVar3, bs = "cs") +
as.factor(FixedVar),
family=binomial(link="logit"), method = "REML",
data = dat)
I tried thinning my data (using only every 3rd data point from consecutive scans), but found this overly restrictive to allow effects to be detected due to my relatively small sample size (only 42 data points left after thinning).
I also tried using the prior value of the binomial response variable as a factor in the model to account for the autocorrelation. This did appear to resolve the residual autocorrelation (based on the updated ACF/PACF plots), but it doesn't feel like the most elegant way to do so and I worry this added variable might be adjusting for more than just the autocorrelation (though it was not collinear with the other explanatory variables; VIF < 2).
I would use bam() for this. You don't need to have big data to fit a with bam(), you just loose some of the guarantees about convergence that you get with gam(). bam() will fit a GEE-like model with an AR(1) working correlation matrix, but you need to specify the AR parameter via rho. This only works for non-Gaussian families if you also set discrete = TRUE when fitting the model.
You could use gamm() with family = binomial() but this uses PQL to estimate the GLMM version of the GAMM and if your binomial counts are low this method isn't very good.

How can i get predictions with CI from lmerTest models?

We are currently working with plant phenology.
We built a linear mixed model for each species present in the study area.
We set Days From Snowmelt (The sum of days from snowmelt to the visit day along the summer) as the response variable while Mean phenology (mean phenology state for each plot ( there are 3 on each locality) is calculated by the mean phenological state from the 12 subplots into each plot is divided. from 1-6, the higher the number the more advanced the cycle). year and plot nested within the locality are set as random factors.
Once the model is built and revised, we want to predict the days from snowmelt for each species to achieve the phenological phases of interest, which happen to have a mean of 2, 3, 4, and 5. (corresponding to vegetative, flowering, fruit development and dispersion, respectively)
I have tried the function predict() but I get no heterogeneity between phases for each species, the progression seems to be linear (as shown in the image file).
Could this be just because is a linear model so will it only give linear responses? Are there any other ways to get predictions from these kinds of models and show their CI?
How can i get predictions with CI from lmerTest models?
I think you probably mean pediction intervals. You can use the predictInterval function in the merTools package. For example:
library(lmerTest); library(merTools)
fm1 <- lmer(Reaction ~ Days + (Days|Subject), data = sleepstudy)
head(predictInterval(fm1, level = 0.95, seed = 123, n.sims = 100))
Could this be just because is a linear model so will it only give linear responses?
Yes ! If you fit a linear model, then the predictions will be linear. Of course, you can model nonlinearity with a linear model in several ways including transformation(s), nonlinear terms (the model is still linear in the parameters) and splines.

R auto.arima() vs arima() giving different result with the same model

I have a question about this time series analysis, with mean monthly air temperature (Deg. F) Nottingham Castle 1920-1939:
https://datamarket.com/data/set/22li/mean-monthly-air-temperature-deg-f-nottingham-castle-1920-1939#!ds=22li&display=line
When I ran
auto.arima(x.t,trace=True)
it gave me "ARIMA(5,0,1) with non-zero mean" and "AIC=1198.42" as the lowest AIC. However, when I manually input the arima model, I came across a model with even lower aic.
arima(x = x.t, order = c(3, 1, 3))
aic = 1136.95.
When I run the function auto.arima(x.t,trace = TRUE,d=1), It gave me ARIMA(2,1,2) with AIC of 1221.413. While ARIMA(3,1,3) with drift gives 1209.947 and ARIMA(3,1,3) gives 1207.859.
I am really confused. I thought auto.arima should automatically suggest you the number of differencing. Why is auto.arima AIC different than the arima AIC while they have the same model?
You're fitting two different ARIMA models. Obviously an ARIMA(5,0,1) model is not the same as an ARIMA(3,1,3) model. In the former, you model p=5 time lags with no differencing, whereas in the latter you consider p=3 time lags with d=1 degree of differencing. Additionally, your model's MA components are also different: q=1 vs. q=3.
Different models will obviously give you different quality metrics (i.e. different AICs).

ARIMA vs. ARMA of time series in first differences

I am working on GDP time series forcast. I have log transformed the time series which has significant stochastic trend. I have checked that the time series in first differences is stationary. Now (i believe) I have two options:
Fit an ARMA model on the differenced log transformed GDP time series
Fit an ARIMA model (p,1,q) on the log transformed GDP time series
QUESTION:
I have noticed that ARIMA does not have an intercept, while ARMA does. How is the intercept to be interpreted?
How should I decide which one to use?
I have noticed that ARIMA does not have an intercept, while ARMA does. How is the intercept to be interpreted?
The intercept interpretation depends on your model. It relates to your mean through your other parameter if the series is stationary. E.g., see the AR(1) example on wiki. An intercept in an order one differentiering ARIMA model implies a constant drift which is likely not what you want.
How should I decide which one to use?
A common choice is to use an information criteria like AIC or BIC. E.g., see this post.

How to get individual coefficients and residuals in panel data using fixed effects

I have a panel data including income for individuals over years, and I am interested in the income trends of individuals, i.e individual coefficients for income over years, and residuals for each individual for each year (the unexpected changes in income according to my model). However, I have a lot of observations with missing income data at least for one or more years, so with a linear regression I lose the majority of my observations. The data structure is like this:
caseid<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)
years<-c(1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008,
1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008)
income<-c(1100,NA,NA,NA,NA,1300,1500,1900,2000,NA,2200,NA,
NA,NA,NA,NA,NA,NA, 2300,2500,2000,1800,NA, 1900)
df<-data.frame(caseid, years, income)
I decided using a random effects model, that I think will still predict income for missing years by using a maximum likelihood approach. However, since Hausman Test gives a significant result I decided to use a fixed effects model. And I ran the code below, using plm package:
inc.fe<-plm(income~years, data=df, model="within", effect="individual")
However, I get coefficients only for years and not for individuals; and I cannot get residuals.
To maybe give an idea, the code in Stata should be
xtest caseid
xtest income year
predict resid, resid
Then I tried to run the pvcm function from the same library, which is a function for variable coefficients.
inc.wi<-pvcm(Income~Year, data=ldf, model="within", effect="individual")
However, I get the following error message:
"Error in FUN(X[[i]], ...) : insufficient number of observations".
How can I get individual coefficients and residuals with pvcm by resolving this error or by using some other function?
My original long form data has 202976 observations and 15 years.
Does the fixef function from package plm give you what you are looking for?
Continuing your example:
fixef(inc.fe)
Residuals are extracted by:
residuals(inc.fe)
You have a random effects model with random slopes and intercepts. This is also known as a random coefficients regression model. The missingness is the tricky part, which (I'm guessing) you'll have to write custom code to solve after you choose how you wish to do so.
But you haven't clearly/properly specified your model (at least in your question) as far as I can tell. Let's define some terms:
Let Y_it = income for ind i (i= 1,..., N) in year t (t= 1,...,T). As I read you question, you have not specified which of the two below models you wish to have:
M1: random intercepts, global slope, random slopes
Y_it ~ N(\mu_i + B T + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
M2: random intercepts, random slopes
Y_it ~ N(\mu_i + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
Also, your example data is nonsensical (see below). As you can see, you don't have enough observations to estimate all parameters. I'm not familiar with library(plm) but the above models (without missingness) can be estimated in lme4 easily. Without a realistic example dataset, I won't bother providing code.
R> table(df$caseid, is.na(df$income))
FALSE TRUE
1 2 4
2 4 2
3 0 6
4 5 1
Given that you do have missingness, you should be able to produce estimates for either hierarchical model via the typical methods, such as EM. But I do think you'll have to write the code to do the estimation yourself.

Resources