Forecast with ggplot2 and funggcast function - r

On this website, Mr. Davenport published a function to plot an arima forecast with ggplot2 on the example of an arbitrary dataset, he published here. I can follow his example without any error message.
Now, when I use my data, I would end with the warning:
1: In window.default(x, ...) : 'end' value not changed
2: In window.default(x, ...) : 'end' value not changed
I know that it happens when I call this command pd <- funggcast(yt, yfor) due to an issue with the data I indicate in my data end = c(2013). But I do not know how to fix that.
This is the code I use:
library(ggplot2)
library(zoo)
library(forecast)
myts <- ts(rnorm(55), start = c(1960), end = c(2013), freq = 1)
funggcast <- function(dn, fcast){
en <- max(time(fcast$mean)) # Extract the max date used in the forecast
# Extract Source and Training Data
ds <- as.data.frame(window(dn, end = en))
names(ds) <- 'observed'
ds$date <- as.Date(time(window(dn, end = en)))
# Extract the Fitted Values (need to figure out how to grab confidence intervals)
dfit <- as.data.frame(fcast$fitted)
dfit$date <- as.Date(time(fcast$fitted))
names(dfit)[1] <- 'fitted'
ds <- merge(ds, dfit, all.x = T) # Merge fitted values with source and training data
# Extract the Forecast values and confidence intervals
dfcastn <- as.data.frame(fcast)
dfcastn$date <- as.Date(as.yearmon(row.names(dfcastn)))
names(dfcastn) <- c('forecast','lo80','hi80','lo95','hi95','date')
pd <- merge(ds, dfcastn,all.x = T) # final data.frame for use in ggplot
return(pd)
}
yt <- window(myts, end = c(2013)) # extract training data until last year
yfit <- auto.arima(myts) # fit arima model
yfor <- forecast(yfit) # forecast
pd <- funggcast(yt, yfor) # extract the data for ggplot using function funggcast()
ggplot(data = pd, aes(x = date,y = observed)) + geom_line(color = "red") + geom_line(aes(y = fitted), color = "blue") + geom_line(aes(y = forecast)) + geom_ribbon(aes(ymin = lo95, ymax = hi95), alpha = .25) + scale_x_date(name = "Time in Decades") + scale_y_continuous(name = "GDP per capita (current US$)") + theme(axis.text.x = element_text(size = 10), legend.justification=c(0,1), legend.position=c(0,1)) + ggtitle("Arima(0,1,1) Fit and Forecast of GDP per capita for Brazil (1960-2013)") + scale_color_manual(values = c("Blue", "Red"), breaks = c("Fitted", "Data", "Forecast"))
Edit: I found another blog here with a function to use with forecast and ggplot2 but I would like to use the approach above, if I were able to find my mistake. Anyone?
Edit2:
If I run your updated code with my data here, than I get the graph down below. Note that I did not change the end = c(2023) for mtys, otherwise it would not merge the forecasted with the fitted value.
myts <- ts(WDI_gdp_capita$Brazil, start = c(1960), end = c(2023), freq = 1)
funggcast <- function(dn, fcast){
en <- max(time(fcast$mean)) # Extract the max date used in the forecast
# Extract Source and Training Data
ds <- as.data.frame(window(dn, end = en))
names(ds) <- 'observed'
ds$date <- as.Date(time(window(dn, end = en)))
# Extract the Fitted Values (need to figure out how to grab confidence intervals)
dfit <- as.data.frame(fcast$fitted)
dfit$date <- as.Date(time(fcast$fitted))
names(dfit)[1] <- 'fitted'
ds <- merge(ds, dfit, all = T) # Merge fitted values with source and training data
# Extract the Forecast values and confidence intervals
dfcastn <- as.data.frame(fcast)
dfcastn$date <- as.Date(paste(row.names(dfcastn),"01","01",sep="-"))
names(dfcastn) <- c('forecast','lo80','hi80','lo95','hi95','date')
pd <- merge(ds, dfcastn,all.x = T) # final data.frame for use in ggplot
return(pd)
} # ggplot function by Frank Davenport
yt <- window(myts, end = c(2013)) # extract training data until last year
yfit <- auto.arima(yt) # fit arima model
yfor <- forecast(yfit) # forecast
pd <- funggcast(myts, yfor) # extract the data for ggplot using function funggcast()
ggplot(data = pd, aes(x = date, y = observed)) + geom_line(color = "red") + geom_line(aes(y = fitted), color = "blue") + geom_line(aes(y = forecast)) + geom_ribbon(aes(ymin = lo95, ymax = hi95), alpha = .25) + scale_x_date(name = "Time in Decades") + scale_y_continuous(name = "GDP per capita (current US$)") + theme(axis.text.x = element_text(size = 10), legend.justification=c(0,1), legend.position=c(0,1)) + ggtitle("Arima(0,1,1) Fit and Forecast of GDP per capita for Brazil (1960-2013)") + scale_color_manual(values = c("Blue", "Red"), breaks = c("Fitted", "Data", "Forecast")) + ggsave((filename = "gdp_forecast_ggplot.pdf"), width=330, height=180, units=c("mm"), dpi = 300, limitsize = TRUE)
The almost perfect graph I get:
One additional question: How can I get a legend in this graph?
If I set end = c(2013) for myts, I get the same graph as in the beginning:

There are several points that are different between Mr Davenport's analysis and the plot you are trying to make.
The first one is that he is comparing the the arima forecast to some observed data, which is why he trains the model on a portion of the whole time series, the training set.
To do this, you should make your initial time series longer:
myts <- ts(rnorm(55), start = c(1960), end = c(2023), freq = 1)
Then at the end of your script, where you select the training up to 2013:
yt <- window(myts, end = c(2013)) # extract training data until last year
The model should be trained on the training set, not the whole time series, so you should change the yfit line to:
yfit <- auto.arima(yt) # fit arima model
And call the funggcast function using the whole time series, because it needs the observed and fitted data:
pd <- funggcast(myts, yfor)
Finally, he uses dates that have month and year, so in his funggcast function, change this line:
dfcastn$date <- as.Date(as.yearmon(row.names(dfcastn)))
To:
dfcastn$date <- as.Date(paste(row.names(dfcastn),"01","01",sep="-"))
This is because the values predicted by the model need to be changed to dates, like 2014 has to be changed to 2014-01-01, in order to be merged with the observed data.
After all the changes, the code looks like this:
library(ggplot2)
library(zoo)
library(forecast)
myts <- ts(rnorm(55), start = c(1960), end = c(2013), freq = 1)
funggcast <- function(dn, fcast){
en <- max(time(fcast$mean)) # Extract the max date used in the forecast
# Extract Source and Training Data
ds <- as.data.frame(window(dn, end = en))
names(ds) <- 'observed'
ds$date <- as.Date(time(window(dn, end = en)))
# Extract the Fitted Values (need to figure out how to grab confidence intervals)
dfit <- as.data.frame(fcast$fitted)
dfit$date <- as.Date(time(fcast$fitted))
names(dfit)[1] <- 'fitted'
ds <- merge(ds, dfit, all.x = T) # Merge fitted values with source and training data
# Extract the Forecast values and confidence intervals
dfcastn <- as.data.frame(fcast)
dfcastn$date <- as.Date(paste(row.names(dfcastn),"01","01",sep="-"))
names(dfcastn) <- c('forecast','lo80','hi80','lo95','hi95','date')
pd <- merge(ds, dfcastn,all= T) # final data.frame for use in ggplot
return(pd)
}
yt <- window(myts, end = c(2013)) # extract training data until last year
yfit <- auto.arima(yt) # fit arima model
yfor <- forecast(yfit) # forecast
pd <- funggcast(myts, yfor) # extract the data for ggplot using function funggcast()
plotData<-ggplot(data = pd, aes(x = date, y = observed)) + geom_line(aes(color = "1")) +
geom_line(aes(y = fitted,color="2")) +
geom_line(aes(y = forecast,color="3")) +
scale_colour_manual(values=c("red", "blue","black"),labels = c("Observed", "Fitted", "Forecasted"),name="Data")+
geom_ribbon(aes(ymin = lo95, ymax = hi95), alpha = .25)+
scale_x_date(name = "Time in Decades") +
scale_y_continuous(name = "GDP per capita (current US$)")+
theme(axis.text.x = element_text(size = 10)) +
ggtitle("Arima(0,1,1) Fit and Forecast of GDP per capita for Brazil (1960-2013)")
plotData
And you get a plot that looks like this, the fitting is pretty bad with a completely random time series. Also ggplot will output some errors because the forecast line has no data before 2013 and the fitted data does not go on after 2013. (I ran it several times, depending on the initial, random time series, the model might just predict 0 everywhere)
Edit: changed the pd assignment line as well, in case there are no observed data after 2013
Edit2: I changed the ggplot function at the end of the code to make sure the legend shows up

There is a package called ggfortify available via github which allows straight plotting of forecast objects with ggplot2. It can be found on http://rpubs.com/sinhrks/plot_ts

This is a bump on a rather old post, but there's a fuction in github that produces some nice results.
Here's the code as it was on Aug 03, 2016:
function(forec.obj, data.color = 'blue', fit.color = 'red', forec.color = 'black',
lower.fill = 'darkgrey', upper.fill = 'grey', format.date = F)
{
serie.orig = forec.obj$x
serie.fit = forec.obj$fitted
pi.strings = paste(forec.obj$level, '%', sep = '')
if(format.date)
dates = as.Date(time(serie.orig))
else
dates = time(serie.orig)
serie.df = data.frame(date = dates, serie.orig = serie.orig, serie.fit = serie.fit)
forec.M = cbind(forec.obj$mean, forec.obj$lower[, 1:2], forec.obj$upper[, 1:2])
forec.df = as.data.frame(forec.M)
colnames(forec.df) = c('forec.val', 'l0', 'l1', 'u0', 'u1')
if(format.date)
forec.df$date = as.Date(time(forec.obj$mean))
else
forec.df$date = time(forec.obj$mean)
p = ggplot() +
geom_line(aes(date, serie.orig, colour = 'data'), data = serie.df) +
geom_line(aes(date, serie.fit, colour = 'fit'), data = serie.df) +
scale_y_continuous() +
geom_ribbon(aes(x = date, ymin = l0, ymax = u0, fill = 'lower'), data = forec.df, alpha = I(0.4)) +
geom_ribbon(aes(x = date, ymin = l1, ymax = u1, fill = 'upper'), data = forec.df, alpha = I(0.3)) +
geom_line(aes(date, forec.val, colour = 'forecast'), data = forec.df) +
scale_color_manual('Series', values=c('data' = data.color, 'fit' = fit.color, 'forecast' = forec.color)) +
scale_fill_manual('P.I.', values=c('lower' = lower.fill, 'upper' = upper.fill))
if (format.date)
p = p + scale_x_date()
p
}

Related

How to generate covariate-adjusted cox survival/hazard functions?

I'm using the survminer package to try to generate survival and hazard function graphs for a longitudinal student-level dataset that has 5 subgroups of interest.
I've had success creating a model that shows the survival functions without adjusting for student-level covariates using ggsurvplot.
ggsurvplot(survfit(Surv(expectedgr, sped) ~ langstatus_new, data=mydata), pvalue=TRUE)
Output example
However, I cannot manage to get these curves adjusted for covariates. My aim is to create graphs like these. As you can see, these are covariate-adjusted survival curves according to some factor variable. Does anyone how such graphs can be obtained in R?
You want to obtain survival probabilities from a Cox model for certain values of some covariate of interest, while adjusting for other covariates. However, because we do not make any assumption on the distribution of the survival times in a Cox model, we cannot directly obtain survival probabilities from it. We first have to estimate the baseline hazard function, which is typically done with the non-parametric Breslow estimator. When the Cox model is fitted with coxph from the survival package, we can obtain such probabilites with a call to the survfit() function. You may consult ?survfit.coxph for more information.
Let's see how we can do this by using the lung data set.
library(survival)
# select covariates of interest
df <- subset(lung, select = c(time, status, age, sex, ph.karno))
# assess whether there are any missing observations
apply(df, 2, \(x) sum(is.na(x))) # 1 in ph.karno
# listwise delete missing observations
df <- df[complete.cases(df), ]
# Cox model
fit <- coxph(Surv(time, status == 2) ~ age + sex + ph.karno, data = df)
## Note that I ignore the fact that ph.karno does not satisfy the PH assumption.
# specify for which combinations of values of age, sex, and
# ph.karno we want to derive survival probabilies
ND1 <- with(df, expand.grid(
age = median(age),
sex = c(1,2),
ph.karno = median(ph.karno)
))
ND2 <- with(df, expand.grid(
age = median(age),
sex = 1, # males
ph.karno = round(create_intervals(n_groups = 3L))
))
# Obtain the expected survival times
sfit1 <- survfit(fit, newdata = ND1)
sfit2 <- survfit(fit, newdata = ND2)
The code behind the function create_intervals() can be found in this post. I just simply replaced speed with ph.karno in the function.
The output sfit1 contains the expected median survival times and the corresponding 95% confidence intervals for the combinations of covariates as specified in ND1.
> sfit1
Call: survfit(formula = fit, newdata = ND)
n events median 0.95LCL 0.95UCL
1 227 164 283 223 329
2 227 164 371 320 524
Survival probabilities at specific follow-up times be obtained with the times argument of the summary() method.
# survival probabilities at 200 days of follow-up
summary(sfit1, times = 200)
The output contains again the expected survival probability, but now after 200 days of follow-up, wherein survival1 corresponds to the expected survival probability of the first row of ND1, i.e. a male and female patient of median age with median ph.karno.
> summary(sfit1, times = 200)
Call: survfit(formula = fit, newdata = ND1)
time n.risk n.event survival1 survival2
200 144 71 0.625 0.751
The 95% confidence limits associated with these two probabilities can be manually extracted from summary().
sum_sfit <- summary(sfit1, times = 200)
sum_sfit <- t(rbind(sum_sfit$surv, sum_sfit$lower, sum_sfit$upper))
colnames(sum_sfit) <- c("S_hat", "2.5 %", "97.5 %")
# ------------------------------------------------------
> sum_sfit
S_hat 2.5 % 97.5 %
1 0.6250586 0.5541646 0.7050220
2 0.7513961 0.6842830 0.8250914
If you would like to use ggplot to depict the expected survival probabilities (and the corresponding 95% confidence intervals) for the combinations of values as specified in ND1 and ND2, we first need to make data.frames that contain all the information in an appropriate format.
# function which returns the output from a survfit.object
# in an appropriate format, which can be used in a call
# to ggplot()
df_fun <- \(surv_obj, newdata, factor) {
len <- length(unique(newdata[[factor]]))
out <- data.frame(
time = rep(surv_obj[['time']], times = len),
n.risk = rep(surv_obj[['n.risk']], times = len),
n.event = rep(surv_obj[['n.event']], times = len),
surv = stack(data.frame(surv_obj[['surv']]))[, 'values'],
upper = stack(data.frame(surv_obj[['upper']]))[, 'values'],
lower = stack(data.frame(surv_obj[['lower']]))[, 'values']
)
out[, 7] <- gl(len, length(surv_obj[['time']]))
names(out)[7] <- 'factor'
return(out)
}
# data for the first panel (A)
df_leftPanel <- df_fun(surv_obj = sfit1, newdata = ND1, factor = 'sex')
# data for the second panel (B)
df_rightPanel <- df_fun(surv_obj = sfit2, newdata = ND2, factor = 'ph.karno')
Now that we have defined our data.frames, we need to define a new function which allows us to plot the 95% CIs. We assign it the generic name geom_stepribbon.
library(ggplot2)
# Function for geom_stepribbon
geom_stepribbon <- function(
mapping = NULL,
data = NULL,
stat = "identity",
position = "identity",
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE, ...) {
layer(
data = data,
mapping = mapping,
stat = stat,
geom = GeomStepribbon,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ... )
)
}
GeomStepribbon <- ggproto(
"GeomStepribbon", GeomRibbon,
extra_params = c("na.rm"),
draw_group = function(data, panel_scales, coord, na.rm = FALSE) {
if (na.rm) data <- data[complete.cases(data[c("x", "ymin", "ymax")]), ]
data <- rbind(data, data)
data <- data[order(data$x), ]
data$x <- c(data$x[2:nrow(data)], NA)
data <- data[complete.cases(data["x"]), ]
GeomRibbon$draw_group(data, panel_scales, coord, na.rm = FALSE)
}
)
Finally, we can plot the expected survival probabilities for ND1 and ND2.
yl <- 'Expected Survival probability\n'
xl <- '\nTime (days)'
# left panel
my_colours <- c('blue4', 'darkorange')
adj_colour <- \(x) adjustcolor(x, alpha.f = 0.2)
my_colours <- c(
my_colours, adj_colour(my_colours[1]), adj_colour(my_colours[2])
)
left_panel <- ggplot(df_leftPanel,
aes(x = time, colour = factor, fill = factor)) +
geom_step(aes(y = surv), size = 0.8) +
geom_stepribbon(aes(ymin = lower, ymax = upper), colour = NA) +
scale_colour_manual(name = 'Sex',
values = c('1' = my_colours[1],
'2' = my_colours[2]),
labels = c('1' = 'Males',
'2' = 'Females')) +
scale_fill_manual(name = 'Sex',
values = c('1' = my_colours[3],
'2' = my_colours[4]),
labels = c('1' = 'Males',
'2' = 'Females')) +
ylab(yl) + xlab(xl) +
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.position = 'top')
# right panel
my_colours <- c('blue4', 'darkorange', '#00b0a4')
my_colours <- c(
my_colours, adj_colour(my_colours[1]),
adj_colour(my_colours[2]), adj_colour(my_colours[3])
)
right_panel <- ggplot(df_rightPanel,
aes(x = time, colour = factor, fill = factor)) +
geom_step(aes(y = surv), size = 0.8) +
geom_stepribbon(aes(ymin = lower, ymax = upper), colour = NA) +
scale_colour_manual(name = 'Ph.karno',
values = c('1' = my_colours[1],
'2' = my_colours[2],
'3' = my_colours[3]),
labels = c('1' = 'Low',
'2' = 'Middle',
'3' = 'High')) +
scale_fill_manual(name = 'Ph.karno',
values = c('1' = my_colours[4],
'2' = my_colours[5],
'3' = my_colours[6]),
labels = c('1' = 'Low',
'2' = 'Middle',
'3' = 'High')) +
ylab(yl) + xlab(xl) +
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.position = 'top')
# composite plot
library(ggpubr)
ggarrange(left_panel, right_panel,
ncol = 2, nrow = 1,
labels = c('A', 'B'))
Output
Interpretation
Panel A shows the expected survival probabilities for a male and female patient of median age with a median ph.karno.
Panel B shows the expected survival probabilities for three male patients of median age with ph.karnos of 67 (low), 83 (middle), and 100 (high).
These survival curves will always satisfy the PH assumption, as they were derived from the Cox model.
Note: use function(x) instead of \(x) if you use a version of R <4.1.0
Although correct, I believe that the method described in the answer of Dion Groothof is not what is usually of interest. Usually, researchers are interested in visualizing the causal effect of a variable adjusted for confounders. Simply showing the predicted survival curve for one single covariate combination does not really do the trick here. I would recommend reading up on confounder-adjusted survival curves. See https://arxiv.org/abs/2203.10002 for example.
Those type of curves can be calculated in R using the adjustedCurves package: https://github.com/RobinDenz1/adjustedCurves
In your example, the following code could be used:
library(survival)
library(devtools)
# install adjustedCurves from github, load it
devtools::install_github("/RobinDenz1/adjustedCurves")
library(adjustedCurves)
# "event" needs to be binary
lung$status <- lung$status - 1
# "variable" needs to be a factor
lung$ph.ecog <- factor(lung$ph.ecog)
fit <- coxph(Surv(time, status) ~ ph.ecog + age + sex, data=lung,
x=TRUE)
# calculate and plot curves
adj <- adjustedsurv(data=lung, variable="ph.ecog", ev_time="time",
event="status", method="direct",
outcome_model=fit, conf_int=TRUE)
plot(adj)
Producing the following output:
These survival curves are adjusted for the effect of age and sex. More information on how this adjustment works can be found in the documentation of the adjustedCurves package or the article I cited above.

How to get f(x)=kx+d form of linear regression line?

I have a data frame dt.Data with time data (values of this data frame are changing each day) and I'm plotting an correlation scatter plot and the regression line with ggplot(). The R code looks like this:
set.seed(123)
## Create data frame: ##
df.Data <- data.frame(date = seq(as.Date('2018-01-01'), by = '1 day', length.out = 1100),
DE = rnorm(1100, 2, 1), AT = rnorm(1100, 5, 2))
corPearson <- cor.test(x = df.Data$DE, y = df.Data$AT, method = "pearson")
df.Data$year <- format(as.Date(df.Data$date), '%Y')
p <- ggplot(data = df.Data, aes(x = DE, y = AT, group = 1)
) +
geom_point(aes(color = year)) +
geom_smooth(method = "lm", se = FALSE, color = "#007d3c") +
theme_classic() +
theme(legend.position = "none") +
theme(panel.background = element_blank()) +
scale_colour_brewer(palette = 'Greens') +
xlab(product1) +
ylab(product2) +
ggtitle("Correlation Scatter Plot (Pearson)") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
# Correlation plot converting from ggplot to plotly: #
CorrelationPlot <- plotly::ggplotly(p, tooltip = "text")
The regression line is plotted with: geom_smooth(method = "lm", se = FALSE, color = "#007d3c").
The plot looks like this:
My question now is:
How do I get the function of the regression line in the form f(x) = kx + d? I have already seen this question a few times in stackoverflow, but no answer there was complete or useful. Can someone help me?
EDIT:
If I use this
reg <- lm(df.Data$AT ~ df.Data$DE)
summary(reg)
the output of the summary is:
where the d=5.07667 (red) and k=-0.03306 (blue)? Is this correct?
How can I extract both values and construct a function like this: f(x)=kx+d=-0.3303x+5.07667??
I need this f(x) as an output of a valueBox() in a RShiny app.
You can use the lm() function:
reg <- lm (df.Data$AT ~ df.Data$DE)
summary (reg)
When you summarize you can see the intercept, which is your d and the slop which is you k.
Let me know if this helps :)
For new predictions you can use the predict function.
Tip:
For predictions on new data you have to provide for the "newdata" argument in the predict function a data.frame with the same column names as in your formula. But if you fit your model in the following way:
reg <- lm (df.Data$AT ~ df.Data$DE)
You would have to create a new data.frame with the column name 'df.Data$DE' which is irritating.
Using the "data" argument from the lm function is more convenient:
reg <- lm(AT ~ DE, data = df.Data)
predict(reg, newdata = data.frame(DE = 2.0))
Tip:
If you are only interested in the coefficients of the model, you can also use the 'coef' method:
reg <- lm(AT ~ DE, data = df.Data)
coef(reg)

Plot forecast and actual values

I all. I need some help from statistics expert. I have made a simple arima forecast for few values in as. But I have taken a subset of values in train_as.
Now is there a way to plot both actual values and forecasted values here. Like the actual values in 2019 is 4,8,12,16 and the forecast is 9,10,11,12. Can we plot this?
as <- data.frame(a=c(1,2,3,4,2,4,6,8,4,8,12,16))
train_as <- as[c(1:8),]
a1 <- ts(train_as,start = c(2017,1),end = c(2017,8),frequency = 4)
fit_arima <-auto.arima(a1, trace= TRUE, ic ="aic")
print(summary(fit_arima))
checkresiduals(fit_arima)
fcst <- forecast(fit_arima,h=4)
autoplot(fcst,include = 8)
This is easy to do using the forecast package with the autolayer() function.
library(forecast)
library(ggplot2)
as <- data.frame(a = c(1, 2, 3, 4, 2, 4, 6, 8, 4, 8, 12, 16))
# Convert to a time series
y <- ts(as$a, start = 2017, frequency = 4)
# Split in two
a1 <- subset(y, end = 8)
a2 <- subset(y, start = 9)
# Fit model
fit_arima <- auto.arima(a1, ic = "aic")
# Compute forecasts
fcst <- forecast(fit_arima, h = 4)
# Plot forecasts and test set
autoplot(fcst) + autolayer(a2)
Created on 2019-12-02 by the reprex package (v0.3.0)
You can try something like this, first you create your test dataset:
test_as <- as[c(9:12),]
Now a data.frame to plot, you can see the real data, the time, and the predicted values (and their ICs) that should be with the same length of the time and real data, so I pasted a NAs vector with length equal to the difference between the real data and the predicted, and the predicted (same for the ICs). Note the time is generated by quarter with zoo package:
library(zoo)
df <-
data.frame(real = as$a,
pred = c(rep(NA,length(as$a)-length(data.frame(fcst)[,1])),data.frame(fcst)[,1]),
time = zoo::as.yearqtr(seq(as.Date("2017/1/1"), as.Date("2019/12/1"), by = "quarter"), format = "%Y-%m-%d"),
Lo80 =c(rep(NA,length(as$a)-length(data.frame(fcst)[,2])),data.frame(fcst)[,2]),
Hi80 =c(rep(NA,length(as$a)-length(data.frame(fcst)[,3])),data.frame(fcst)[,3]),
Lo95 =c(rep(NA,length(as$a)-length(data.frame(fcst)[,4])),data.frame(fcst)[,4]),
Hi95 =c(rep(NA,length(as$a)-length(data.frame(fcst)[,5])),data.frame(fcst)[,5]))
Now you can plot it:
library(ggplot2)
ggplot(df, aes(time, pred, group = 1)) +
geom_line() +
geom_line(aes(time, real, group = 1), color = "red")+
geom_ribbon(aes(time, ymin = Lo95, ymax = Hi95), fill = "red", alpha = 0.25) +
geom_ribbon(aes(time, ymin = Lo80, ymax = Hi80), fill = "red", alpha = 0.25) +
theme_light()

Model predictions: how to manipulate dummy variables across time?

I'm trying to predict insect populations across a year. I've built my model (a GAM, using the package mgcv). I then used the predict() function after I built a dummy dataset to build this prediction off of . This is where I'm struggling.
My question is: how can I build a new dummy dataset that will simulate, say cold winters vs. warm winters? I have just a "temperature" parameter, and I'm not sure how to manipulate that through time (or seasons). Ideally, I'd like to create a cold winter with mean summer temperatures and a warm winter with mean summer temperatures. Any suggestions would be greatly appreciated!
Quickly, my smoothing parameters in the model are: Average temperature, humidity, and day of year (doy). I have three random effect parameters in the model too. My model, prediction, and graph generated are below.
m1 <- gam(total ~ s(temp.avg) + s(humid) + s(doy, bs="cc", k=5) +
s(trap, bs="re")+s(site, bs="re")+s(year, bs="re"),
family=nb(),gamma=1.4,method="REML",data=dfe)
N <- 200
M <- 365
pdat1 <- with(dfe, expand.grid(year = c("2013","2014","2015","2016","2017"),
humid = mean(humid, na.rm = TRUE),
temp.avg = mean(temp.avg, na.rm = TRUE),
doy = seq(min(doy), max(doy), length = M),
trap = c("a","b","c","d"),
site = c("A","B", "C", "D")))
pred1 <- predict(m1, newdata = pdat1, type = "response", se.fit=TRUE)
crit <- qt(0.975, df = df.residual(m1)) # ~95% interval critical t
pdat1 <- transform(pdat1, fitted = pred1$fit, se = pred1$se.fit)
pdat1 <- transform(pdat1,
upper = fitted + (crit * se),
lower = fitted - (crit * se))
ggplot(pdat1, aes(x = doy, y = fitted)) +
geom_line() + theme_classic()+
labs(y = "Predicted Population", x = "Day of Year") +
theme(legend.position = "top")

Add datapoints to existing scatterplot

I have an existing ggplot2 scatterplot which shows the results of a parameter against from normal database. I then want to add two additional points to this graph which I would pass as command line arguments to my script script age value1 value2. I would like to show these points as red with an r and l geom_text above each point. I have the following code so far but do not know how to add the finishing touches
pkgLoad <- function(x)
{
if (!require(x,character.only = TRUE))
{
install.packages(x,dep=TRUE, repos='http://star-www.st-andrews.ac.uk/cran/')
if(!require(x,character.only = TRUE)) stop("Package not found")
}
}
pkgLoad("ggplot2")
#load current normals database
df<-data.frame(read.csv("dat_normals.txt", sep='\t', header=T))
args<-commandArgs(TRUE)
#specify what each argument is
age <- args[1]
rSBR <- args[2]
lSBR <- args[3]
# RUN REGRESSION AND APPEND PREDICTION INTERVALS
lm_fit = lm(SBR ~ Age, data = df)
sbr_with_pred = data.frame(df, predict(lm_fit, interval='prediction'))
p <- ggplot(sbr_with_pred, aes(x=Age, y=SBR)) +
geom_point(shape=19, alpha=1/4) +
geom_smooth(method = 'lm', aes(fill = 'confidence'), alpha = 0.5) +
geom_ribbon(aes(y = fit, ymin = lwr, ymax = upr,
fill = 'prediction'), alpha = 0.2) +
scale_fill_manual('Interval', values = c('green', 'blue')) +
theme_bw() +
theme(legend.position = "none")
ggsave(filename=paste("/home/data/wolf/FV_DAT/dat_results.png",sep=""))
browseURL(paste("/home/data/wolf/FV_DAT/dat_results.png",sep""))
Essentially, I want to see if the 2 new points fall within the 95% confidence intervals from the normal database (blue ribbon)
Your example is not reproducible. It is really constructive to create data and reproducible example. It is not a waste of time. For the solution, I write what it is said in the comment. You add a new layer with new data.
newdata <- data.frame(Age = args[1],
SBR = c(args[2],args[3]))
p + geom_point(data=newdata,colour="red",size=10)
For example:
sbr_with_pred <- data.frame(Age = sample(15:36,50,rep=T),
SBR = rnorm(50))
p <- ggplot(sbr_with_pred, aes(x=Age, y=SBR)) +
geom_point(shape=19, alpha=1/4) +
geom_smooth(method = 'lm', aes(fill = 'confidence'), alpha = 0.5)
args <- c(20,rnorm(1),rnorm(2))
newdata <- data.frame(Age = args[1],
SBR = c(args[2],args[3]))
p + geom_point(data=newdata,colour="red",size=10)

Resources