Model predictions: how to manipulate dummy variables across time? - r

I'm trying to predict insect populations across a year. I've built my model (a GAM, using the package mgcv). I then used the predict() function after I built a dummy dataset to build this prediction off of . This is where I'm struggling.
My question is: how can I build a new dummy dataset that will simulate, say cold winters vs. warm winters? I have just a "temperature" parameter, and I'm not sure how to manipulate that through time (or seasons). Ideally, I'd like to create a cold winter with mean summer temperatures and a warm winter with mean summer temperatures. Any suggestions would be greatly appreciated!
Quickly, my smoothing parameters in the model are: Average temperature, humidity, and day of year (doy). I have three random effect parameters in the model too. My model, prediction, and graph generated are below.
m1 <- gam(total ~ s(temp.avg) + s(humid) + s(doy, bs="cc", k=5) +
s(trap, bs="re")+s(site, bs="re")+s(year, bs="re"),
family=nb(),gamma=1.4,method="REML",data=dfe)
N <- 200
M <- 365
pdat1 <- with(dfe, expand.grid(year = c("2013","2014","2015","2016","2017"),
humid = mean(humid, na.rm = TRUE),
temp.avg = mean(temp.avg, na.rm = TRUE),
doy = seq(min(doy), max(doy), length = M),
trap = c("a","b","c","d"),
site = c("A","B", "C", "D")))
pred1 <- predict(m1, newdata = pdat1, type = "response", se.fit=TRUE)
crit <- qt(0.975, df = df.residual(m1)) # ~95% interval critical t
pdat1 <- transform(pdat1, fitted = pred1$fit, se = pred1$se.fit)
pdat1 <- transform(pdat1,
upper = fitted + (crit * se),
lower = fitted - (crit * se))
ggplot(pdat1, aes(x = doy, y = fitted)) +
geom_line() + theme_classic()+
labs(y = "Predicted Population", x = "Day of Year") +
theme(legend.position = "top")

Related

How to generate covariate-adjusted cox survival/hazard functions?

I'm using the survminer package to try to generate survival and hazard function graphs for a longitudinal student-level dataset that has 5 subgroups of interest.
I've had success creating a model that shows the survival functions without adjusting for student-level covariates using ggsurvplot.
ggsurvplot(survfit(Surv(expectedgr, sped) ~ langstatus_new, data=mydata), pvalue=TRUE)
Output example
However, I cannot manage to get these curves adjusted for covariates. My aim is to create graphs like these. As you can see, these are covariate-adjusted survival curves according to some factor variable. Does anyone how such graphs can be obtained in R?
You want to obtain survival probabilities from a Cox model for certain values of some covariate of interest, while adjusting for other covariates. However, because we do not make any assumption on the distribution of the survival times in a Cox model, we cannot directly obtain survival probabilities from it. We first have to estimate the baseline hazard function, which is typically done with the non-parametric Breslow estimator. When the Cox model is fitted with coxph from the survival package, we can obtain such probabilites with a call to the survfit() function. You may consult ?survfit.coxph for more information.
Let's see how we can do this by using the lung data set.
library(survival)
# select covariates of interest
df <- subset(lung, select = c(time, status, age, sex, ph.karno))
# assess whether there are any missing observations
apply(df, 2, \(x) sum(is.na(x))) # 1 in ph.karno
# listwise delete missing observations
df <- df[complete.cases(df), ]
# Cox model
fit <- coxph(Surv(time, status == 2) ~ age + sex + ph.karno, data = df)
## Note that I ignore the fact that ph.karno does not satisfy the PH assumption.
# specify for which combinations of values of age, sex, and
# ph.karno we want to derive survival probabilies
ND1 <- with(df, expand.grid(
age = median(age),
sex = c(1,2),
ph.karno = median(ph.karno)
))
ND2 <- with(df, expand.grid(
age = median(age),
sex = 1, # males
ph.karno = round(create_intervals(n_groups = 3L))
))
# Obtain the expected survival times
sfit1 <- survfit(fit, newdata = ND1)
sfit2 <- survfit(fit, newdata = ND2)
The code behind the function create_intervals() can be found in this post. I just simply replaced speed with ph.karno in the function.
The output sfit1 contains the expected median survival times and the corresponding 95% confidence intervals for the combinations of covariates as specified in ND1.
> sfit1
Call: survfit(formula = fit, newdata = ND)
n events median 0.95LCL 0.95UCL
1 227 164 283 223 329
2 227 164 371 320 524
Survival probabilities at specific follow-up times be obtained with the times argument of the summary() method.
# survival probabilities at 200 days of follow-up
summary(sfit1, times = 200)
The output contains again the expected survival probability, but now after 200 days of follow-up, wherein survival1 corresponds to the expected survival probability of the first row of ND1, i.e. a male and female patient of median age with median ph.karno.
> summary(sfit1, times = 200)
Call: survfit(formula = fit, newdata = ND1)
time n.risk n.event survival1 survival2
200 144 71 0.625 0.751
The 95% confidence limits associated with these two probabilities can be manually extracted from summary().
sum_sfit <- summary(sfit1, times = 200)
sum_sfit <- t(rbind(sum_sfit$surv, sum_sfit$lower, sum_sfit$upper))
colnames(sum_sfit) <- c("S_hat", "2.5 %", "97.5 %")
# ------------------------------------------------------
> sum_sfit
S_hat 2.5 % 97.5 %
1 0.6250586 0.5541646 0.7050220
2 0.7513961 0.6842830 0.8250914
If you would like to use ggplot to depict the expected survival probabilities (and the corresponding 95% confidence intervals) for the combinations of values as specified in ND1 and ND2, we first need to make data.frames that contain all the information in an appropriate format.
# function which returns the output from a survfit.object
# in an appropriate format, which can be used in a call
# to ggplot()
df_fun <- \(surv_obj, newdata, factor) {
len <- length(unique(newdata[[factor]]))
out <- data.frame(
time = rep(surv_obj[['time']], times = len),
n.risk = rep(surv_obj[['n.risk']], times = len),
n.event = rep(surv_obj[['n.event']], times = len),
surv = stack(data.frame(surv_obj[['surv']]))[, 'values'],
upper = stack(data.frame(surv_obj[['upper']]))[, 'values'],
lower = stack(data.frame(surv_obj[['lower']]))[, 'values']
)
out[, 7] <- gl(len, length(surv_obj[['time']]))
names(out)[7] <- 'factor'
return(out)
}
# data for the first panel (A)
df_leftPanel <- df_fun(surv_obj = sfit1, newdata = ND1, factor = 'sex')
# data for the second panel (B)
df_rightPanel <- df_fun(surv_obj = sfit2, newdata = ND2, factor = 'ph.karno')
Now that we have defined our data.frames, we need to define a new function which allows us to plot the 95% CIs. We assign it the generic name geom_stepribbon.
library(ggplot2)
# Function for geom_stepribbon
geom_stepribbon <- function(
mapping = NULL,
data = NULL,
stat = "identity",
position = "identity",
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE, ...) {
layer(
data = data,
mapping = mapping,
stat = stat,
geom = GeomStepribbon,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ... )
)
}
GeomStepribbon <- ggproto(
"GeomStepribbon", GeomRibbon,
extra_params = c("na.rm"),
draw_group = function(data, panel_scales, coord, na.rm = FALSE) {
if (na.rm) data <- data[complete.cases(data[c("x", "ymin", "ymax")]), ]
data <- rbind(data, data)
data <- data[order(data$x), ]
data$x <- c(data$x[2:nrow(data)], NA)
data <- data[complete.cases(data["x"]), ]
GeomRibbon$draw_group(data, panel_scales, coord, na.rm = FALSE)
}
)
Finally, we can plot the expected survival probabilities for ND1 and ND2.
yl <- 'Expected Survival probability\n'
xl <- '\nTime (days)'
# left panel
my_colours <- c('blue4', 'darkorange')
adj_colour <- \(x) adjustcolor(x, alpha.f = 0.2)
my_colours <- c(
my_colours, adj_colour(my_colours[1]), adj_colour(my_colours[2])
)
left_panel <- ggplot(df_leftPanel,
aes(x = time, colour = factor, fill = factor)) +
geom_step(aes(y = surv), size = 0.8) +
geom_stepribbon(aes(ymin = lower, ymax = upper), colour = NA) +
scale_colour_manual(name = 'Sex',
values = c('1' = my_colours[1],
'2' = my_colours[2]),
labels = c('1' = 'Males',
'2' = 'Females')) +
scale_fill_manual(name = 'Sex',
values = c('1' = my_colours[3],
'2' = my_colours[4]),
labels = c('1' = 'Males',
'2' = 'Females')) +
ylab(yl) + xlab(xl) +
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.position = 'top')
# right panel
my_colours <- c('blue4', 'darkorange', '#00b0a4')
my_colours <- c(
my_colours, adj_colour(my_colours[1]),
adj_colour(my_colours[2]), adj_colour(my_colours[3])
)
right_panel <- ggplot(df_rightPanel,
aes(x = time, colour = factor, fill = factor)) +
geom_step(aes(y = surv), size = 0.8) +
geom_stepribbon(aes(ymin = lower, ymax = upper), colour = NA) +
scale_colour_manual(name = 'Ph.karno',
values = c('1' = my_colours[1],
'2' = my_colours[2],
'3' = my_colours[3]),
labels = c('1' = 'Low',
'2' = 'Middle',
'3' = 'High')) +
scale_fill_manual(name = 'Ph.karno',
values = c('1' = my_colours[4],
'2' = my_colours[5],
'3' = my_colours[6]),
labels = c('1' = 'Low',
'2' = 'Middle',
'3' = 'High')) +
ylab(yl) + xlab(xl) +
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.position = 'top')
# composite plot
library(ggpubr)
ggarrange(left_panel, right_panel,
ncol = 2, nrow = 1,
labels = c('A', 'B'))
Output
Interpretation
Panel A shows the expected survival probabilities for a male and female patient of median age with a median ph.karno.
Panel B shows the expected survival probabilities for three male patients of median age with ph.karnos of 67 (low), 83 (middle), and 100 (high).
These survival curves will always satisfy the PH assumption, as they were derived from the Cox model.
Note: use function(x) instead of \(x) if you use a version of R <4.1.0
Although correct, I believe that the method described in the answer of Dion Groothof is not what is usually of interest. Usually, researchers are interested in visualizing the causal effect of a variable adjusted for confounders. Simply showing the predicted survival curve for one single covariate combination does not really do the trick here. I would recommend reading up on confounder-adjusted survival curves. See https://arxiv.org/abs/2203.10002 for example.
Those type of curves can be calculated in R using the adjustedCurves package: https://github.com/RobinDenz1/adjustedCurves
In your example, the following code could be used:
library(survival)
library(devtools)
# install adjustedCurves from github, load it
devtools::install_github("/RobinDenz1/adjustedCurves")
library(adjustedCurves)
# "event" needs to be binary
lung$status <- lung$status - 1
# "variable" needs to be a factor
lung$ph.ecog <- factor(lung$ph.ecog)
fit <- coxph(Surv(time, status) ~ ph.ecog + age + sex, data=lung,
x=TRUE)
# calculate and plot curves
adj <- adjustedsurv(data=lung, variable="ph.ecog", ev_time="time",
event="status", method="direct",
outcome_model=fit, conf_int=TRUE)
plot(adj)
Producing the following output:
These survival curves are adjusted for the effect of age and sex. More information on how this adjustment works can be found in the documentation of the adjustedCurves package or the article I cited above.

Boxplot not showing range

I have predicted values, via:
glm0 <- glm(use ~ as.factor(decision), data = decision_use, family = binomial(link = "logit"))
predicted_glm <- predict(glm0, newdata = decision_use, type = "response", interval = "confidence", se = TRUE)
predict <- predicted_glm$fit
predict <- predict + 1
head(predict)
1 2 3 4 5 6
0.3715847 0.3095335 0.3095335 0.3095335 0.3095335 0.5000000
Now when I plot a box plot using ggplot2,
ggplot(decision_use, aes(x = decision, y = predict)) +
geom_boxplot(aes(fill = factor(decision)), alpha = .2)
I get a box plot with one horizontal line per categorical variable. If you look at the predict data, it's same for each categorical variable, so makes sense.
But I want a box plot with the range. How can I get that? When I use "use" instead of predict, I get boxes stretching from end to end (1 to 0). So I suppose that's not it. Thank you in advance.
To clarify, predicted_glm includes se.fit values. I wonder how to incorporate those.
It doesn't really make sense to do a boxplot here. A boxplot shows the range and spread of a continuous variable within groups. Your dependent variable is binary, so the values are all 0 or 1. Since you are plotting predictions for each group, your plot would have just a single point representing the expected value (i.e. the probability) for each group.
The closest you can come is probably to plot the prediction with 95% confidence bars around it.
You haven't provided any sample data, so I'll make some up here:
set.seed(100)
df <- data.frame(outcome = rbinom(200, 1, c(0.1, 0.9)), var1 = rep(c("A", "B"), 100))
Now we'll create our model and get the prediction for each level of my predictor variable using the newdata parameter of predict. I'm going to specify type = "link" because I want the log odds, and I'm also going to specify se.fit = TRUE so I can get the standard error of these predictions:
mod <- glm(outcome ~ var1, data = df, family = binomial)
prediction <- predict(mod, list(var1 = c("A", "B")), se.fit = TRUE, type = "link")
Now I can work out the 95% confidence intervals for my predictions:
prediction$lower <- prediction$fit - prediction$se.fit * 1.96
prediction$upper <- prediction$fit + prediction$se.fit * 1.96
Finally, I transform the fit and confidence intervals from log odds into probabilities:
prediction <- lapply(prediction, function(logodds) exp(logodds)/(1 + exp(logodds)))
plotdf <- data.frame(Group = c("A", "B"), fit = prediction$fit,
upper = prediction$upper, lower = prediction$lower)
plotdf
#> Group fit upper lower
#> 1 A 0.13 0.2111260 0.07700412
#> 2 B 0.92 0.9594884 0.84811360
Now I am ready to plot. I will use geom_points for the probability estimates and geom_errorbars for the confidence intervals :
library(ggplot2)
ggplot(plotdf, aes(x = Group, y = fit, colour = Group)) +
geom_errorbar(aes(ymin = lower, ymax = upper), size = 2, width = 0.5) +
geom_point(size = 3, colour = "black") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Probability estimate with 95% CI", y = "Probability")
Created on 2020-05-11 by the reprex package (v0.3.0)

Hazard ratio plot with confidence "waist" in ggplot2

When fitting a cox model that includes spline terms for a continuous covariate, I would like to be able to produce a plot of the hazard ratio across range of that covariate (relative to a fixed reference value) using ggplot2.
I have adapted an example from Terry Therneau's splines vignette here (see page 3). The only issue with this approach is the lack of a "waist" in the confidence interval at the reference value, as in this plot:
The example below produces the following plot, without the narrowing of the CI at the reference value.
library(survival)
library(splines)
library(ggplot2)
# colon cancer death dataset
ccd <- na.omit(subset(colon, etype == 2))
# fit model with ns() term for age
cox <- coxph(Surv(time, status) ~ rx + sex + ns(age, knots = c(20, 50, 70)), data = ccd)
# get data for plot
tp <- termplot(cox, se = TRUE, plot = FALSE)
# hazard ratio plot for natural spline of age, with reference # 50 yrs
ref <- tp$age$y[tp$age$x == 50]
ggplot() +
geom_line(data = tp$age, aes(x = x, y = exp(y - ref))) +
geom_line(data = tp$age, aes(x = x, y = exp(y - 1.96 * se - ref)), linetype = 2) +
geom_line(data = tp$age, aes(x = x, y = exp(y + 1.96 * se - ref)), linetype = 2) +
geom_hline(aes(yintercept = 1), linetype = 3) +
geom_rug(data = ccd, aes(x = age), sides = "b") +
labs(x = "Age at baseline, years",
y = "Hazard Ratio (95% CI) vs. 50 years",
title = "Mortality hazard ratio as a function of age",
subtitle = "Natural spline: knots at 20, 50, and 70 years")
I am aware that there are features in the rms package and the smoothHRpackage that produce these types of plots, but I am looking for a solution that is amenable to ggplot2 graphics and the coxph() function in the survival package. My question therefore boils down to:
Is there a way to adapt the output of termplot() to produce a plot with a "waist" at the reference value?
If termplot() cannot be used, how can I obtain the relevant plotting data by other means?
Edit 1: As the first comment suggested, this can be accomplished using rms and ggplot2 together. For example:
library(rms)
dd <- datadist(ccd)
dd$limits$age[2] <- 50
options(datadist = "dd")
cph <- cph(Surv(time, status) ~ rx + sex + rcs(age, c(20, 50, 70)), data = ccd, x = TRUE, y = TRUE)
pdata <- Predict(cph, age, ref.zero = TRUE, fun = exp)
ggplot(data = pdata) +
geom_hline(aes(yintercept = 1), linetype = 3) +
labs(x = "Age at baseline, years",
y = "Hazard Ratio (95% CI) vs. 50 years",
title = "Mortality hazard ratio as a function of age",
subtitle = "Natural spline: knots at 20, 50, and 70 years")
Which produces a plot very close to what I am after:
However, I would still like to know if there is a way to do this using coxph() and ns(). Not that I have anything against the rms package, I just have a bunch of old code based on survivalfunctionality.

How to plot confidence intervals for predicted probabilities with "facet_wrap" used in R

I am using the following packages:
library(aod)
library(MASS)
library(ggplot2)
I am following the example R code from the following link:
http://stats.idre.ucla.edu/r/dae/logit-regression/
Here is the code for my GLMM
str(data1)
flocation <- factor(data1$location)
fID <- factor(data1$ID)
GLMM1 <- glmmPQL(presence ~ water + location + turbidity + temperature +
sp.cond, random = ~ 1|fID, family = binomial, data = data1)
summary(GLMM1)
I made predictions based on varying location and water levels, while holding temp and turb constant
newdata1 <- with(data1,
data.frame(water = water,
temperature = mean(temperature),
turbidity = mean(turbidity),
sp.cond = mean(sp.cond),
flocation = flocation))
newdata1$water.levelPred <- predict(GLMM1, type = "response")
newdata1
To get confidence intervals I used the code below
newdata2 <- cbind(newdata1, predict(GLMM1, newdata = newdata1, type = "link",
se = TRUE))
newdata2 <- within(newdata2, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
I get the following errors after running the confidence interval code:
Error in predict.lme(object, newdata, level = level, na.action =
na.action) : cannot evaluate groups for desired levels on 'newdata'
Error in plogis(fit) : object 'fit' not found
Why would this occur?
Because I can't get past this step I am having problems moving forward to plot the CI with predicted probabilities with the code below:
plot in ggplot2
ggplot(newdata2, aes(x = water, y = water.levelProb)) + geom_ribbon(aes(ymin = LL, ymax = UL, fill = flocation), alpha = 0.2) + geom_line(aes(colour = flocation),size = 1)+facet_wrap(~flocation)+xlab("Water Depth (m)")+ylab("Predicted Probability")+theme_bw()

Forecast with ggplot2 and funggcast function

On this website, Mr. Davenport published a function to plot an arima forecast with ggplot2 on the example of an arbitrary dataset, he published here. I can follow his example without any error message.
Now, when I use my data, I would end with the warning:
1: In window.default(x, ...) : 'end' value not changed
2: In window.default(x, ...) : 'end' value not changed
I know that it happens when I call this command pd <- funggcast(yt, yfor) due to an issue with the data I indicate in my data end = c(2013). But I do not know how to fix that.
This is the code I use:
library(ggplot2)
library(zoo)
library(forecast)
myts <- ts(rnorm(55), start = c(1960), end = c(2013), freq = 1)
funggcast <- function(dn, fcast){
en <- max(time(fcast$mean)) # Extract the max date used in the forecast
# Extract Source and Training Data
ds <- as.data.frame(window(dn, end = en))
names(ds) <- 'observed'
ds$date <- as.Date(time(window(dn, end = en)))
# Extract the Fitted Values (need to figure out how to grab confidence intervals)
dfit <- as.data.frame(fcast$fitted)
dfit$date <- as.Date(time(fcast$fitted))
names(dfit)[1] <- 'fitted'
ds <- merge(ds, dfit, all.x = T) # Merge fitted values with source and training data
# Extract the Forecast values and confidence intervals
dfcastn <- as.data.frame(fcast)
dfcastn$date <- as.Date(as.yearmon(row.names(dfcastn)))
names(dfcastn) <- c('forecast','lo80','hi80','lo95','hi95','date')
pd <- merge(ds, dfcastn,all.x = T) # final data.frame for use in ggplot
return(pd)
}
yt <- window(myts, end = c(2013)) # extract training data until last year
yfit <- auto.arima(myts) # fit arima model
yfor <- forecast(yfit) # forecast
pd <- funggcast(yt, yfor) # extract the data for ggplot using function funggcast()
ggplot(data = pd, aes(x = date,y = observed)) + geom_line(color = "red") + geom_line(aes(y = fitted), color = "blue") + geom_line(aes(y = forecast)) + geom_ribbon(aes(ymin = lo95, ymax = hi95), alpha = .25) + scale_x_date(name = "Time in Decades") + scale_y_continuous(name = "GDP per capita (current US$)") + theme(axis.text.x = element_text(size = 10), legend.justification=c(0,1), legend.position=c(0,1)) + ggtitle("Arima(0,1,1) Fit and Forecast of GDP per capita for Brazil (1960-2013)") + scale_color_manual(values = c("Blue", "Red"), breaks = c("Fitted", "Data", "Forecast"))
Edit: I found another blog here with a function to use with forecast and ggplot2 but I would like to use the approach above, if I were able to find my mistake. Anyone?
Edit2:
If I run your updated code with my data here, than I get the graph down below. Note that I did not change the end = c(2023) for mtys, otherwise it would not merge the forecasted with the fitted value.
myts <- ts(WDI_gdp_capita$Brazil, start = c(1960), end = c(2023), freq = 1)
funggcast <- function(dn, fcast){
en <- max(time(fcast$mean)) # Extract the max date used in the forecast
# Extract Source and Training Data
ds <- as.data.frame(window(dn, end = en))
names(ds) <- 'observed'
ds$date <- as.Date(time(window(dn, end = en)))
# Extract the Fitted Values (need to figure out how to grab confidence intervals)
dfit <- as.data.frame(fcast$fitted)
dfit$date <- as.Date(time(fcast$fitted))
names(dfit)[1] <- 'fitted'
ds <- merge(ds, dfit, all = T) # Merge fitted values with source and training data
# Extract the Forecast values and confidence intervals
dfcastn <- as.data.frame(fcast)
dfcastn$date <- as.Date(paste(row.names(dfcastn),"01","01",sep="-"))
names(dfcastn) <- c('forecast','lo80','hi80','lo95','hi95','date')
pd <- merge(ds, dfcastn,all.x = T) # final data.frame for use in ggplot
return(pd)
} # ggplot function by Frank Davenport
yt <- window(myts, end = c(2013)) # extract training data until last year
yfit <- auto.arima(yt) # fit arima model
yfor <- forecast(yfit) # forecast
pd <- funggcast(myts, yfor) # extract the data for ggplot using function funggcast()
ggplot(data = pd, aes(x = date, y = observed)) + geom_line(color = "red") + geom_line(aes(y = fitted), color = "blue") + geom_line(aes(y = forecast)) + geom_ribbon(aes(ymin = lo95, ymax = hi95), alpha = .25) + scale_x_date(name = "Time in Decades") + scale_y_continuous(name = "GDP per capita (current US$)") + theme(axis.text.x = element_text(size = 10), legend.justification=c(0,1), legend.position=c(0,1)) + ggtitle("Arima(0,1,1) Fit and Forecast of GDP per capita for Brazil (1960-2013)") + scale_color_manual(values = c("Blue", "Red"), breaks = c("Fitted", "Data", "Forecast")) + ggsave((filename = "gdp_forecast_ggplot.pdf"), width=330, height=180, units=c("mm"), dpi = 300, limitsize = TRUE)
The almost perfect graph I get:
One additional question: How can I get a legend in this graph?
If I set end = c(2013) for myts, I get the same graph as in the beginning:
There are several points that are different between Mr Davenport's analysis and the plot you are trying to make.
The first one is that he is comparing the the arima forecast to some observed data, which is why he trains the model on a portion of the whole time series, the training set.
To do this, you should make your initial time series longer:
myts <- ts(rnorm(55), start = c(1960), end = c(2023), freq = 1)
Then at the end of your script, where you select the training up to 2013:
yt <- window(myts, end = c(2013)) # extract training data until last year
The model should be trained on the training set, not the whole time series, so you should change the yfit line to:
yfit <- auto.arima(yt) # fit arima model
And call the funggcast function using the whole time series, because it needs the observed and fitted data:
pd <- funggcast(myts, yfor)
Finally, he uses dates that have month and year, so in his funggcast function, change this line:
dfcastn$date <- as.Date(as.yearmon(row.names(dfcastn)))
To:
dfcastn$date <- as.Date(paste(row.names(dfcastn),"01","01",sep="-"))
This is because the values predicted by the model need to be changed to dates, like 2014 has to be changed to 2014-01-01, in order to be merged with the observed data.
After all the changes, the code looks like this:
library(ggplot2)
library(zoo)
library(forecast)
myts <- ts(rnorm(55), start = c(1960), end = c(2013), freq = 1)
funggcast <- function(dn, fcast){
en <- max(time(fcast$mean)) # Extract the max date used in the forecast
# Extract Source and Training Data
ds <- as.data.frame(window(dn, end = en))
names(ds) <- 'observed'
ds$date <- as.Date(time(window(dn, end = en)))
# Extract the Fitted Values (need to figure out how to grab confidence intervals)
dfit <- as.data.frame(fcast$fitted)
dfit$date <- as.Date(time(fcast$fitted))
names(dfit)[1] <- 'fitted'
ds <- merge(ds, dfit, all.x = T) # Merge fitted values with source and training data
# Extract the Forecast values and confidence intervals
dfcastn <- as.data.frame(fcast)
dfcastn$date <- as.Date(paste(row.names(dfcastn),"01","01",sep="-"))
names(dfcastn) <- c('forecast','lo80','hi80','lo95','hi95','date')
pd <- merge(ds, dfcastn,all= T) # final data.frame for use in ggplot
return(pd)
}
yt <- window(myts, end = c(2013)) # extract training data until last year
yfit <- auto.arima(yt) # fit arima model
yfor <- forecast(yfit) # forecast
pd <- funggcast(myts, yfor) # extract the data for ggplot using function funggcast()
plotData<-ggplot(data = pd, aes(x = date, y = observed)) + geom_line(aes(color = "1")) +
geom_line(aes(y = fitted,color="2")) +
geom_line(aes(y = forecast,color="3")) +
scale_colour_manual(values=c("red", "blue","black"),labels = c("Observed", "Fitted", "Forecasted"),name="Data")+
geom_ribbon(aes(ymin = lo95, ymax = hi95), alpha = .25)+
scale_x_date(name = "Time in Decades") +
scale_y_continuous(name = "GDP per capita (current US$)")+
theme(axis.text.x = element_text(size = 10)) +
ggtitle("Arima(0,1,1) Fit and Forecast of GDP per capita for Brazil (1960-2013)")
plotData
And you get a plot that looks like this, the fitting is pretty bad with a completely random time series. Also ggplot will output some errors because the forecast line has no data before 2013 and the fitted data does not go on after 2013. (I ran it several times, depending on the initial, random time series, the model might just predict 0 everywhere)
Edit: changed the pd assignment line as well, in case there are no observed data after 2013
Edit2: I changed the ggplot function at the end of the code to make sure the legend shows up
There is a package called ggfortify available via github which allows straight plotting of forecast objects with ggplot2. It can be found on http://rpubs.com/sinhrks/plot_ts
This is a bump on a rather old post, but there's a fuction in github that produces some nice results.
Here's the code as it was on Aug 03, 2016:
function(forec.obj, data.color = 'blue', fit.color = 'red', forec.color = 'black',
lower.fill = 'darkgrey', upper.fill = 'grey', format.date = F)
{
serie.orig = forec.obj$x
serie.fit = forec.obj$fitted
pi.strings = paste(forec.obj$level, '%', sep = '')
if(format.date)
dates = as.Date(time(serie.orig))
else
dates = time(serie.orig)
serie.df = data.frame(date = dates, serie.orig = serie.orig, serie.fit = serie.fit)
forec.M = cbind(forec.obj$mean, forec.obj$lower[, 1:2], forec.obj$upper[, 1:2])
forec.df = as.data.frame(forec.M)
colnames(forec.df) = c('forec.val', 'l0', 'l1', 'u0', 'u1')
if(format.date)
forec.df$date = as.Date(time(forec.obj$mean))
else
forec.df$date = time(forec.obj$mean)
p = ggplot() +
geom_line(aes(date, serie.orig, colour = 'data'), data = serie.df) +
geom_line(aes(date, serie.fit, colour = 'fit'), data = serie.df) +
scale_y_continuous() +
geom_ribbon(aes(x = date, ymin = l0, ymax = u0, fill = 'lower'), data = forec.df, alpha = I(0.4)) +
geom_ribbon(aes(x = date, ymin = l1, ymax = u1, fill = 'upper'), data = forec.df, alpha = I(0.3)) +
geom_line(aes(date, forec.val, colour = 'forecast'), data = forec.df) +
scale_color_manual('Series', values=c('data' = data.color, 'fit' = fit.color, 'forecast' = forec.color)) +
scale_fill_manual('P.I.', values=c('lower' = lower.fill, 'upper' = upper.fill))
if (format.date)
p = p + scale_x_date()
p
}

Resources