Get marginal effect and predicted probability for glmer model in R - r

I'm trying to calculate both the predicted probability values and marginal effects values (with p-values) for a categorical variable over time in a logistic regression model in R. Basically, I want to know 1) the predicted probability of the response variable (an event occurring) in each year for sample sites in one of 2 categories and 2) the average marginal effect of a site being in 1 category vs. the other in each year. I can get predicted probability values using the ggeffects package and marginal effects values from the margins package, but I haven't figured out a way to get both sets of values from a single package.
So my questions are 1) is there a package/method to get both of these sets of values, and 2) if I get the predicted probability values from ggeffects and the marginal effects values from margins, are these values compatible? Or are there differences in the ways that the packages treat the models that mean I can't assume the marginal effects from one correspond to the predicted probabilities of the other? 3) In the margins package, how can I get the average marginal effect of the interaction of two factor variables over time? And 4) how can I get margins() to work with a large dataset?
Here is some sample data:
### Make dataset
df <- data.frame(year = rep(2001:2010, each = 100),
state = rep(c("montana", "idaho",
"colorado", "wyoming", "utah"),
times = 10, each = 20),
site_id = as.factor(rep(1:100, times = 10)),
cat_variable = as.factor(rep(0:1, times = 5, each = 10)),
ind_cont_variable = rnorm(100, mean = 20, sd = 5),
event_occurred = as.factor(sample(c(0, 1),
replace = TRUE,
size = 1000)))
### Add dummy columns for states
library(fastDummies)
df <- dummy_cols(df,
select_columns = "state",
remove_first_dummy = TRUE)
I'm interested in the effects of the state and the categorical variable on the probability that the event occurred, and in how the effect of the state and categorical variable changed over time. Here's the model:
library(lme4)
fit_state <- glmer(event_occurred ~ ind_cont_variable +
cat_variable*year*state +
(1|site_id),
data = df,
family = binomial(link = "logit"),
nAGQ = 0,
control = glmerControl(optimizer = "nloptwrap"))
I can use ggeffects to get the predicted probability values for each state and category combination over time:
library(ggeffects)
fit_pp_state <- data.frame(ggpredict(fit_state,
terms = c("year [all]",
"cat_variable",
"state")))
head(fit_pp_state)
### x = year, predicted = predicted probability, group = categorical variable level, facet = state
# x predicted std.error conf.low conf.high group facet
# 2001 0.2835665 0.3981910 0.1535170 0.4634655 0 colorado
# 2001 0.5911911 0.3762090 0.4089121 0.7514289 0 idaho
# 2001 0.5038673 0.3719418 0.3288209 0.6779708 0 montana
# 4 2001 0.7101610 0.3964843 0.5297327 0.8420101 0 utah
# 5 2001 0.5714579 0.3747205 0.3901606 0.7354088 0 wyoming
# 6 2001 0.6788503 0.3892568 0.4963910 0.8192719 1 colorado
This is really great for visualizing the changes in predicted probability over time in the 5 states. But I can't figure out how to go from these values to estimates of marginal effects using ggeffects. Using the margins package, I can get the marginal effect of the categorical variable over time, but I'm not sure how to interpret the outputs of the two different packages together or if that's even appropriate (my first two questions). In addition, I'm not sure how to get margins to give me the marginal effect of a sample site being in each combination of categorical variable level/state over time (bringing me to my third question):
library(margins)
fit_state_me <- summary(margins(fit_state,
at = list(year = 2001:2010),
variables = "cat_variable"))
head(fit_state_me)
# factor year AME SE z p lower
# cat_variable1 2001.0000 0.0224 0.0567 0.3953 0.6926 -0.0887
# cat_variable1 2002.0000 0.0146 0.0490 0.2978 0.7659 -0.0814
# cat_variable1 2003.0000 0.0062 0.0418 0.1478 0.8825 -0.0757
# cat_variable1 2004.0000 -0.0026 0.0359 -0.0737 0.9413 -0.0731
# cat_variable1 2005.0000 -0.0117 0.0325 -0.3604 0.7186 -0.0754
# cat_variable1 2006.0000 -0.0208 0.0325 -0.6400 0.5222 -0.0845
The actual dataset I'm using is fairly large (the csv of raw data is 1.51 GB and the regression model object is 1.29 GB when I save it as a .rds file). When I try to use margins() on my data, I get an error message:
Error: cannot allocate vector of size 369.5 Gb
Any advice for getting around this issue so that I can use this function on my data?
I'd be grateful for any tips-- packages I should check out, mistakes I'm making in my code or my conceptual understanding, etc. Thank you!

Related

Troubles predicting fixed effects from a hierarchical GAM in mgcv

I have been fitting different hierarchical GAMs (hereafter: HGAM) using mgcv in R. I can extract and plot their predictions for their random effects without problems. Conversely, extracting and plotting their predictions for their fixed effects only works for some models, and I don't know why.
Here is a practical example, which refers to the color spectra of flowers from two species (Taxon) sampled at various localities (also discussed here):
rm(list=ls()) # wipe R's memory clean
library(pacman) # load packages, installing them from CRAN if needed
p_load(RCurl) # allows accessing data from URL
ss <- read.delim(text=getURL("https://raw.githubusercontent.com/marcoplebani85/datasets/master/flower_color_spectra.txt"))
head(ss)
ss$density <- ifelse(ss$density<0, 0, ss$density) # set spurious negative reflectance values to zero
ss$clr <- ifelse(ss$Taxon=="SpeciesB", "red", "black")
ss <- with(ss, ss[order(Locality, wl), ])
These are the mean color spectra at the population level for the two species (rolling means were used):
Each color refers to a different species. Each line refers to a different locality.
The following model is a HGAM of type G according to Pedersen et al.'s classification (2019) and it does not give any issues:
gam_G1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(Locality, bs="re"), # "re" is short for "random effect"
data = ss, method = 'REML',
family="quasipoisson"
)
# gam.check(gam_G1)
# k.check(gam_G1)
# MuMIn::AICc(gam_G1)
# gratia::draw(gam_G1)
# plot(gam_G1, pages=1)
# use gam_G1 to predict wl by Locality
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl", "Taxon", "Locality", "clr")])
# predict:
pred <- predict(object= gam_G1, newdata=nn, type="response", se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
# use gam_G1 to predict wl by Taxon
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl",
"Taxon",
"Locality",
"clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127
# predict:
pred <- predict(object = gam_G1,
type="response",
newdata=nn,
se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
R warns me that factor levels 0 not in original fit, but it executes the task without issues:
Left panel: gam_G1 predictions at the Locality level. Right panel: gam_G1 predictions for the fixed effects.
Troublesome models
The following model is a HGAM of type "GI" sensu Pedersen et al. (2019). It produces more accurate predictions at the Locality level, but I can only get NA as predictions at the level of fixed effects:
# GI: models with a global smoother for all observations,
# plus group-level smoothers, the wiggliness of which is estimated individually
start_time <- Sys.time()
gam_GI1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(wl, by = Locality, bs="tp", m=1)
# "tp" is short for "thin plate [regression spline]"
+ s(Locality, bs="re"),
family="quasipoisson",
data = ss, method = 'REML'
)
end_time <- Sys.time()
end_time - start_time # it took ~2.2 minutes on my computer
# gam.check(gam_GI1)
# k.check(gam_GI1)
# MuMIn::AICc(gam_GI1)
Attempt at drawing predictions for the fixed effects (Taxon and wl) according to gam_GI1:
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl",
"Taxon",
"Locality",
"clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127
# predict:
pred <- predict(object = gam_GI1,
type="response",
# exclude="c(Locality)",
# # this should turn random effect off
# # (doesn't work for me)
newdata=nn,
se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
head(nn)
# wl Taxon Locality clr fit se
# 1 298.34 SpeciesB 0 red NA NA
# 2 305.82 SpeciesB 0 red NA NA
# 3 313.27 SpeciesB 0 red NA NA
# 4 320.72 SpeciesB 0 red NA NA
# 5 328.15 SpeciesB 0 red NA NA
# 6 335.57 SpeciesB 0 red NA NA
Left panel: gam_GI1 predictions at the Locality level. Right panel (blank): gam_GI1 predictions for the fixed effects.
The following model, which includes a global smoother for all observations, plus group-level smoothers, all with the same "wiggliness", doesn't provide fixed-effect predictions either:
gam_GS1 <- bam(density ~ Taxon # main effect
+ s(wl, by = Taxon, k = 20) # interaction
+ s(wl, by = Locality, bs="fs", m=1),
# "fs" is short for "factor-smoother [interaction]"
family="quasipoisson",
data = ss, method = 'REML'
)
Why don't gam_GI1 and gam_GS1 produce predictions for their fixed effects, and how can I obtain them?
The models can take a few minutes to run. To save time, their output can be downloaded from here as an RData file. My R scripts (which include the code for plotting the figures) are available here.
I think you are conflating several things here; The by trick to turn off random effects only works for bs = "re" smooths. Locality is a factor (otherwise your random effect isn't a random intercept) and setting it to 0 is creating a new level (although it could be creating an NA as 0 isn't among the original levels.
If what you want to do is turn off anything to do with Locality, you should use exclude; however you have the invocation wrong. The reason why it's not working is because you are creating a character vector with a single element "c(Locality)". This fails for obvious reasons once you realize that c(Locality) doesn't related to anything in your model. What you need to provide here is a vector of smooth names as printed by summary(). For example, to exclude the smooth s(Locality, bs = "re"), {mgcv} knows this as s(Locality), so you would use exclude = "s(Locality)".
In your case, it is tedious to type out all the "s(wl):LocalityLevelX" labels for each smooth. As you have only two taxa, it would be easier to use the complimentary argument terms, where you list smooth labels that you want to include in the model. So you could do terms = c("s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC") or whatever summary() displays for these smooths.
You also need to include the Taxon term in terms, which I think needs to be:
terms = c("TaxonSpeciesB", TaxonSpeciesC",
"s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC")
If you install and load my {gratia} package, you can use smooths(gam_GI1) to list all the smooth labels as far as {mgcv} knows them.
The by trick works like this:
gam(y ~ x + s(z) + s(id, bs = "re", by = dummy)
where dummy is set to a numeric value 1 when fitting and to 0 when you are predicting. As this is a numeric by variable you are multiplying the smooth by dummy and hence why setting it to 0 excludes the term. The reason why your code isn't working is because you really want separate smooths for wl for each Locality; Locality is an actual variable of interest in your data/model, not a dummy variable we create to achieve the aim of excluding a term from the model.
Hopefully now you can see why exclude and terms are much better solutions than this dummy trick.
FYI, in bs = "tp", the "tp" doesn't mean tensor product smooth. It mean thin plate regression spline (TPRS). You only get tensor product smooths through te(), t2(), or ti() terms.

Obtain predicted probabilities from rstanarm in ordinal regression

How can I generate the posterior probability distribution for each outcome for each predictor in an ordinal regression?
e.g.
what I am looking for is this:
library(rstanarm)
fit_f <- MASS::polr(tobgp ~ agegp, data = esoph)
predict(fit_f,newdata=data.frame(agegp=factor(levels(esoph$agegp))),type = "probs")
Now with rstanarm I do:
fit <- stan_polr(tobgp ~ agegp, data = esoph, method = "logit",
prior = R2(0.2, "mean"), init_r = 0.1, seed = 12345)
But how do I obtain the distribution for the individual outcomes/predictors?
I do get a distribution of probabilities using epred, but I don't understand for which outcome/predictor?
posterior_epred(fit, newdata=data.frame(agegp=factor(levels(esoph$agegp))))
The easiest way to do this in rstanarm is to use the posterior_predict function to obtain posterior predictions and then calculate the proportion of predictions that fall in each outcome category by observation. In code,
PPD <- posterior_predict(fit) # uses esoph
probs <- t(apply(PPD, MARGIN = 2, FUN = table) / nrow(PPD))
The matrix called probs has rows equal to the number of observations (in esoph) and columns equal to the number of categories in tobgp and each of its rows sums to 1.
head(probs)
0-9g/day 10-19 20-29 30+
1 0.26400 0.26250 0.22875 0.24475
2 0.25650 0.26750 0.23050 0.24550
3 0.25175 0.27975 0.22450 0.24400
4 0.25575 0.26000 0.24025 0.24400
5 0.26350 0.26625 0.23575 0.23450
6 0.28275 0.26025 0.21500 0.24200

Generating similar estimates of interactions in afex, lsmeans, and lme4 packages

I would like to know if there is a way get the same estimates of an interaction effect in afex & lsmeans packages as in lmer. The toy data below is for two groups with different intercepts and slopes.
set.seed(1234)
A0 <- rnorm(4,2,1)
B0 <- rnorm(4,2+3,1)
A1 <- rnorm(4,6,1)
B1 <- rnorm(4,6+2,1)
A2 <- rnorm(4,10,1)
B2 <- rnorm(4,10+1,1)
A3 <- rnorm(4,14,1)
B3 <- rnorm(4,14+0,1)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- factor(rep(1:8,times = 4, length = 32))
time <- factor(rep(0:3, each = 8, length = 32))
timeNum <- as.numeric(rep(0:3, each = 8, length = 32))
group <- factor(rep(c("A","B"), times =2, each = 4, length = 32))
df <- data.frame(id, group, time, timeNum, score)
df
And here is the plot
(ggplot(df, aes(x = time, y = score, group = group)) +
stat_summary(fun.y = "mean", geom = "line", aes(linetype = group)) +
stat_summary(fun.y = "mean", geom = "point", aes(shape = group), size = 3) +
coord_cartesian(ylim = c(0,18)))
When I run a standard lmer on the data looking for an estimate of the difference in change in score over time between groups.
summary(modelLMER <- lmer(score ~ group * timeNum + (timeNum|id), df))
I get an estimate for the group*time interaction of -1.07, which means that the increase in score for a one-unit increase in time is ~1 point less in group B than group A. This estimate matches the preset differences I built into the dataset.
What I would like to know is how to do a similar thing in the afex and lsmeans packages.
library(afex)
library(lsmeans)
First I generated the afex model object
modelLM <- aov_ez(id="id", dv="score", data=df, between="group", within="time",
type=3, return="lm")
Then passed that into the lsmeans function
lsMeansLM <- lsmeans(modelLM, ~rep.meas:group)
My goal is to generate an accurate estimate of the group*time interaction in afex and lsmeans. To do so requires specifying custom contrast matrices based on the split specified in the lsmeans function above.
groupMain = list(c(-1,-1,-1,-1,1,1,1,1)) # group main effect
linTrend = list(c(-3,-1,1,3,-3,-1,1,3)) # linear trend
linXGroup = mapply("*", groupMain, linTrend) # group x linear trend interaction
Then I made a master list
contrasts <- list(groupMain=groupMain, linTrend=linTrend, linXGroup=linXGroup)
Which I passed into the contrast function in lsmeans.
contrast(lsMeansLM, contrasts)
The F and p values in the output match those for the automatic tests for linear trend and for the group difference in linear trend generated from a mixed ANCOVA in SPSS. However the mixed ANCOVA does not generate an estimate.
The estimate of the effect using the procedure above, instead of being approx. -1, like in the lmer (and matching the difference I built into the data) is approx. -10, which is wildly inaccurate.
I assume it has something to do with how I am coding the contrast coefficients. I know if I normalise the coefficients of the groupMain matrix by dividing all coefficients by four that yields an accurate estimate of the main effect of group averaged across all timepoints. But I have no idea how to get an accurate estimate either of linear trend averaged across groups (linTrend), or an accurate estimate of the difference in linear trend across groups (linXGroup).
I am not sure if this question is more suitable for here or Cross Validated. I figured here first because it seems to be software related, but I know there are probably deeper issues involved. Any help would be much appreciated.
The issue here is that timeNum is a numeric predictor. Therefore, the interaction is a comparison of slopes. Note this:
> lstrends(modelLMER, ~group, var = "timeNum")
group timeNum.trend SE df lower.CL upper.CL
A 4.047168 0.229166 6.2 3.490738 4.603598
B 2.977761 0.229166 6.2 2.421331 3.534191
Degrees-of-freedom method: satterthwaite
Confidence level used: 0.95
> pairs(.Last.value)
contrast estimate SE df t.ratio p.value
A - B 1.069407 0.3240897 6.2 3.3 0.0157
There's your 1.07 - the opposite sign because the comparison is in the other direction.
I will further explain that the lsmeans result you describe in the question is a comparison of the two group means, not an interaction contrast. lsmeans uses a reference grid:
> ref.grid(modelLMER)
'ref.grid' object with variables:
group = A, B
timeNum = 1.5
and as you can see, timeNum is being held fixed at its mean of 1.5. The LS means are predictions for each group at timeNum = 1.5 -- often called the adjusted means; and the difference is thus the difference between those two adjusted means.
Regarding the discrepancy claimed in obtaining your linear contrast of about 10.7: The linear contrast coefficients c(-3,-1,1,3) give you a multiple of the slope of the line. To get the slope, you need to divide by sum(c(-3,-1,1,3)^2) -- and also multiply by 2, because the contrast coefficients increment by 2.
Thanks to the invaluable help of #rvl I was able to solve this. Here is the code.
In order to generate the correct contrast matrices we first need to normalise them
(mainMat <- c(-1,-1,-1,-1,1,1,1,1)) # main effects matrix
(trendMat <- c(-3,-1,1,3,-3,-1,1,3) # linear trend contrast coefficients
(nTimePoints <- 4) # number of timePoints
(mainNorm <- 1/nTimePoints)
(nGroups <- 2) # number of between-Ss groups
(trendIncrem <- 2) # the incremental increase of each new trend contrast coefficient
(trendNorm <- trendIncrem/(sum(trendMat^2))) # normalising the trend coefficients
Now we create several contrast matrices in the form of lists. These are normalised using the objects we created above
(groupMain = list(mainMat*mainNorm)) # normalised group main effect
(linTrend = list(trendMat*trendNorm)) # normalised linear trend
(linXGroup = list((mainMat*trendMat)*(nGroups*trendNorm))) # group x linear trend interaction
Now pass those lists of matrices into a master list
contrasts <- list(groupMain=groupMain, linTrend=linTrend, linXGroup=linXGroup)
And pass that master list into the contrasts function in lsmeans
contrast(lsMeansLM, contrasts)
This is the output
contrast estimate SE df t.ratio p.value
c(-0.25, -0.25, -0.25, -0.25, 0.25, 0.25, 0.25, 0.25) 1.927788 0.2230903 6 8.641 0.0001
c(-0.15, -0.05, 0.05, 0.15, -0.15, -0.05, 0.05, 0.15) 3.512465 0.1609290 6 21.826 <.0001
c(0.3, 0.1, -0.1, -0.3, -0.3, -0.1, 0.1, 0.3) -1.069407 0.3218581 6 -3.323 0.0160
How do we check if these are accurate estimates?
Note first that the estimate of the group*time interaction is now approximately the same value as is returned by
summary(modelLMER)
The 'main effect' trend (for want of a better descriptor), which is the rate of change in score across the four time points averaged across both levels of group, is 3.51. If we change the coding of the group factor to simple coding via
contrasts(df$group) <- c(-.5,.5)
and run summary(modelLMER) again, the time estimate will now be 3.51.
Finally for the main effect of group, that is, the difference in score between groups averaged across all time points. We can run
pairs(lsmeans(modelLM,"group"))
And this will be -1.92. Thank you #rvl. A great answer. Using afex and lsmeans we have now forced a mixed ANCOVA that treats the repeated measures variable as categorical to give us estimates of group differences in trend and main effects that match those returned by a mixed-effects model where the repeated measures variable is continuous, and with p- and F-values that match those of SPSS.

R: predict.averaging is not taking an offset into account when plotting

I'm currently trying to use the predict.averaging function in MuMIn to create some graphs from some model averaging I've done on some GLMMs. I'm interested in whether the number of insects caught per daylight hour in some traps changes when the traps are left out for different lengths of time; I included offset(log(Daylight)) in my GLMMs to account for this. But when I use the predict function it doesn't take the offset into account and I get the same graph that I get if I hadn't included the offset in the first place. But I know the offset is having an effect due to the output from my model averaged GLMMs, and it's the kind of effect I would expect from observations of my data.
Does anyone know why this problem might be and how I might make predict.averaging take the offset into account? I've included the code that I'm using below:
# global model for total insect abundance
glmm11 <- glmmadmb(Total_polls ~ Max_temp+Wind+Precipitation+Veg_height+Season+Year+log(Mean.nectar+1)+I(log(Nectar+1)-log(Mean.nectar+1))+Pan_colour*Assoc_col+Treatment*Area*Depth+(1|Transect)+offset(log(Daylight)), data = ab, zeroInflation = FALSE, family = "nbinom")
# make predictions based on model averaging output (subset delta < 2)
preds<-predict(ave21, full = F, type = "response", backtransform = FALSE) # on the response scale
Where ave21 is a model averaging object generated using pdredge and model.avg that was constrained to have the offset in every model: model11 <- pdredge(glmm11, cluster = clust, fixed = ~offset(log(Daylight))+(1|Transect)). The object itself looks like this:
Call:
model.avg(object = get.models(object = model11, subset = delta <
2))
Component model call:
glmmadmb(formula = Total_polls ~ <3 unique rhs>, data = ab, family = nbinom,
zeroInflation = FALSE)
Component models:
df logLik AICc delta weight
1/2/3/4/5/6/7/8/9/10/11/12 20 -864.14 1769.22 0.00 0.47
1/2/3/4/5/6/7/8/9/10/11/12/13 23 -861.39 1770.03 0.81 0.31
1/3/4/5/6/7/8/9/10/11/12 19 -865.97 1770.79 1.57 0.21
Term codes:
Area Assoc_col
1 2
Depth I(log(Nectar + 1) - log(Mean.nectar + 1))
3 4
Max_temp Pan_colour
5 6
Season Treatment
7 8
Year log(Mean.nectar + 1)
9 10
offset(log(Daylight)) Area:Depth
11 12
Assoc_col:Pan_colour
13
Which I then used to get predictions:
pred_results<-cbind(glmm21$frame, preds) # append original dataframe to predictions
plot(pred_results$preds~pred_results$Treatment) # Treatment = trap duration (hours)
This code might go around the houses a little as I borrowed it off of a fellow PhD student. The graph I get when I plot my predictions looks like this:[Model predictions vs. Trap duration (hours)][1], which is very different from the view given by the summary results of my model averaging:
(conditional average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) -5.896725 0.948102 0.949386 6.211 < 2e-16 ***
Treatment24 -0.714283 0.130226 0.130403 5.478 < 2e-16 ***
Treatment48 -0.983881 0.122416 0.122582 8.026 < 2e-16 ***
Any help would be great, as I can't find any specific instances where this has been addressed on the site to date. Thank you in advance and please let me know if you need me to add anything to make this question better.
Tom
[1]:
https://i.stack.imgur.com/Pn4dK.jpg

ARIMA modelling, prediction and plotting with CO2 dataset in R

I am working with arima0() and co2. I would like to plot arima0() model over my data. I have tried fitted() and curve() with no success.
Here is my code:
###### Time Series
# format: time series
data(co2)
# format: matrix
dmn <- list(month.abb, unique(floor(time(co2))))
co2.m <- matrix(co2, 12, dimnames = dmn)
co2.dt <- pracma::detrend(co2.m, tt = 'linear')
co2.dt <- ts(as.numeric(co2.dt), start = c(1959,1), frequency=12)
# first diff
co2.dt.dif <- diff(co2.dt,lag = 12)
# Second diff
co2.dt.dif2 <- diff(co2.dt.dif,lag = 1)
With the data prepared, I ran the following arima0:
results <- arima0(co2.dt.dif2, order = c(2,0,0), method = "ML")
resultspredict <- predict(results, n.ahead = 36)
I would like to plot the model and the prediction. I am hoping there is a way to do this in base R. I would also like to be able to plot the predictions as well.
Session 1: To begin with...
To be honest, I am pretty much worried about your way in modelling co2 time series. Something wrong happened already when you de-trended co2. Why use tt = "linear"? You fit a linear trend within each period (i.e., year), and take the residuals for further inspection. This is often not recommended as it tends to introduce artificial effects to the residual series. I would incline to do tt = "constant", i.e., simply dropping off yearly average. This would at least preserve the with-season correlation as in the original data.
Perhaps you want to see some evidence here. Consider using ACF to help you diagnose.
data(co2)
## de-trend by dropping yearly average (no need to use `pracma::detrend`)
yearlymean <- ave(co2, gl(39, 12), FUN = mean)
co2dt <- co2 - yearlymean
## de-trend by dropping within season linear trend
co2.m <- matrix(co2, 12)
co2.dt <- pracma::detrend(co2.m, tt = "linear")
co2.dt <- ts(as.numeric(co2.dt), start = c(1959, 1), frequency = 12)
## compare time series and ACF
par(mfrow = c(2, 2))
ts.plot(co2dt); acf(co2dt)
ts.plot(co2.dt); acf(co2.dt)
Both de-trended series have strong seasonal effect, thus a further seasonal differencing is required.
## seasonal differencing
co2dt.dif <- diff(co2dt, lag = 12)
co2.dt.dif <- diff(co2.dt, lag = 12)
## compare time series and ACF
par(mfrow = c(2, 2))
ts.plot(co2dt.dif); acf(co2dt.dif)
ts.plot(co2.dt.dif); acf(co2.dt.dif)
The ACF for co2.dt.dif has more significant negative correlations. This is the sign of over-de-trending. So we prefer to co2dt. co2dt is already stationary, and no more differencing is needed (otherwise you just over-difference it and introduce more negative autocorrelation).
The big negative spike at lag 1 for ACF of co2dt.dif suggests that we want seasonal MA. Also, the positive spike with the season implies a mild AR process in general. So consider:
## we exclude mean because we found estimation of mean is 0 if we include it
fit <- arima0(co2dt.dif, order = c(1,0,0), seasonal = c(0,0,1), include.mean = FALSE)
Whether this model is doing good, we need to inspect ACF of residuals:
acf(fit$residuals)
Looks like this model is decent (actually pretty great).
For prediction purpose, it is actually a better idea to integrate seasonal differencing of co2dt with model fitting of co2dt.dif. Let's do
fit <- arima0(co2dt, order = c(1,0,0), seasonal = c(0,1,1), include.mean = FALSE)
This will give exactly as same estimate for AR and MA coefficients as above two-stage work, but now prediction is fairly easy to be dealt with a single predict call.
## 3 years' ahead prediction (no prediction error; only mean)
predco2dt <- predict(fit, n.ahead = 36, se.fit = FALSE)
Let's plot co2dt, fitted model and prediction together:
fittedco2dt <- co2dt - fit$residuals
ts.plot(co2dt, fittedco2dt, predco2dt, col = 1:3)
The result looks very promising!
Now the final stage, is to actually map this back to the original co2 series. For fitted values, we just add back the yearly mean we have dropped off:
fittedco2 <- fittedco2dt + yearlymean
But for prediction it is more difficult, because we don't know what yearly mean in the future would be. In this regard, our modelling though looks good, is not practically useful. I will talk about a better idea in another answer. To finish this session, we plot co2 with its fitted values only:
ts.plot(co2, fittedco2, col = 1:2)
Session 2: A better idea for time series modelling
In previous session, we have seen the difficulty in prediction if we separate de-trending and modelling of de-trended series. Now, we try to combine those two stages in one go.
The seasonal pattern of co2 is really strong, so we need a seasonal differencing anyway:
data(co2)
co2dt <- diff(co2, lag = 12)
par(mfrow = c(1,2)); ts.plot(co2dt); acf(co2dt)
After this seasonal differencing, co2dt does not look stationary. So we need further a non-seasonal differencing.
co2dt.dif <- diff(co2dt)
par(mfrow = c(1,2)); ts.plot(co2dt.dif); acf(co2dt.dif)
The negative spikes within season and between season suggest that a MA process is needed for both. I will not work with co2dt.dif; we can work with co2 directly:
fit <- arima0(co2, order = c(0,1,1), seasonal = c(0,1,1))
acf(fit$residuals)
Now the residuals are perfectly uncorrelated! So we have an ARIMA(0,1,1)(0,1,1)[12] model for co2 series.
As usual, fitted values are obtained by subtracting residuals from data:
co2fitted <- co2 - fit$residuals
Predictions are made by a single call to predict:
co2pred <- predict(fit, n.ahead = 36, se.fit = FALSE)
Let's plot them together:
ts.plot(co2, co2fitted, co2pred, col = 1:3)
Oh, this is just gorgeous!
Session 3: Model selection
The story should have finished by now; but I would like to make a comparison with auto.arima from forecast, that can automatically decide on the "best" model.
library(forecast)
autofit <- auto.arima(co2)
#Series: co2
#ARIMA(1,1,1)(1,1,2)[12]
#
#Coefficients:
# ar1 ma1 sar1 sma1 sma2
# 0.2569 -0.5847 -0.5489 -0.2620 -0.5123
#s.e. 0.1406 0.1204 0.5880 0.5701 0.4819
#
#sigma^2 estimated as 0.08576: log likelihood=-84.39
#AIC=180.78 AICc=180.97 BIC=205.5
auto.arima has chosen ARIMA(1,1,1)(1,1,2)[12], which is much more complicated as it involves both seasonal differencing and non-seasonal differencing.
Our model based on step-by-step investigation suggests an ARIMA(0,1,1)(0,1,1)[12]:
fit <- arima0(co2, order = c(0,1,1), seasonal = c(0,1,1))
#Call:
#arima0(x = co2, order = c(0, 1, 1), seasonal = c(0, 1, 1))
#
#Coefficients:
# ma1 sma1
# -0.3495 -0.8515
#s.e. 0.0497 0.0254
#
#sigma^2 estimated as 0.08262: log likelihood = -85.98, aic = 177.96
AIC values suggest our model better. So does BIC:
BIC = -2 * loglik + log(n) * p
We have n <- length(co2) data, and p <- length(fit$coef) + 1 parameters (the additional one for sigma2), thus our model has BIC
-2 * fit$loglik + log(n) * p
# [1] 196.5503
So, auto.arima has over-fitted data.
In fact, as soon as we see ARIMA(1,1,1)(1,1,2)[12], we have strong suspicion for its over-fitting. Because different effects "cancel off" each other. This happens to the additional seasonal MA and non-seasonal AR introduced by auto.arima, as AR introduces positive autocorrelation while MA introduces negative one.

Resources