Using predictNLS to create confidence intervals around fitted values in R? - r

I want to build confidence intervals around a large set of fitted values using predictNLS from the propogate package in R. As an example, I will use the data set they reference in the function description (https://rdrr.io/github/anspiess/propagate/man/predictNLS.html), DNase, and building a model that takes the values conc and density as features:
library(propogate)
library(dplyr)
library(modelr)
DNase <- DNase
modeldna <- DNase %>% group_by(Run) %>%
do(run_model = nls(density ~ a * exp(b * conc),
start = list(a = 1 , b = 0.5),
data = .)) %>% ungroup()
I then want to give each row the model that it is assigned to so that predictions can be added:
DNApredict <- full_join(as_tibble(DNase), modeldna, by = "Run")
Add in the predictions:
DNApredict <- DNApredict %>%
group_by(Run) %>%
do(add_predictions(., var = "predicted_density", first(.$run_model)))
And then, I want to add the confidence interval data that predictNLS seems to provide, by giving it that same data and asking it to give a confidence interval for each fitted point in the predicted_density column:
confidence_interval <- predictNLS(model = modeldna, newdata = DNApredict$predicted_density, interval = "confidence")
However, the following error arises:
Error in as.list(object$call$formula) :
argument "object" is missing, with no default
Does anyone know what might be causing this? I know that it will likely seem obvious to some of you what the object it is calling is, so I apologize if this is a ridiculous question. I am really hoping to be able to use this function to create confidence intervals around a series of fitted values. Thank you very much in advance.

Since you are running an nls on each Run in the sample data set, it is easy to get a list of nls models by splitting each run into its own data frame, and running nls on each data frame using lapply
library(propagate)
DNase <- DNase
modeldna <- DNase %>% split(DNase$Run)
models <- lapply(modeldna, function(d) nls(density ~ a * exp(b * conc),
start = list(a = 1 , b = 0.5),
data = d))
Now we can get predictions for each point in each model just as easily by running predictNLS on each model (again inside lapply)
results <- lapply(seq_along(modeldna), function(i) {
predictNLS(models[[i]], newdata = data.frame(conc = modeldna[[i]]$conc))
})
Because of the output structure of predictNLS, we need to extract the predictions for each row and coerce them into a data frame:
predictions <- lapply(results, function(x) {
as.data.frame(do.call(rbind, lapply(x$prop, function(y) y$prop)))})
Finally, we can stick our predictions (including confidence intervals) back onto the original data frame:
all_results <- do.call(rbind, lapply(seq_along(modeldna),
function(i) cbind(modeldna[[i]], predictions[[i]])))
This now gives us a complete data frame of original data points, and the relevant predictions with confidence intervals.
To show this, we can plot the results in ggplot. Here we show one plot for each run, including its original data, the predicted value as a dotted line, and the 95% confidence limit as a pale blue ribbon:
library(ggplot2)
ggplot(all_results, aes(x = conc, y = density)) +
geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`),
fill = "deepskyblue4", alpha = 0.2) +
geom_point() +
geom_line(aes(y = Mean.1), linetype = 2) +
facet_wrap(.~factor(Run, levels = 1:11)) +
theme_bw()

Related

Getting an interaction plot from a pooled lme model with mids object

Preface - I really hope this makes sense!
I ran a linear-mixed effect model using an imputed dataset (FYI, the data is a mids object imputed using mice). The model has a three-way interaction with 3 continuous variables. I am now trying to plot the interaction using the interactions::interact_plot function. However, I'm receiving an error when I run the plot code, which I believe is due to the fact that the model came from a mids object and not a data frame. Does anyone know how to address this error or if there's a better way to get the plot that I'm trying to get?
Thanks very much in advance!
MIDmod1 <- with(data = df.mids, exp = lmer(GC ~ Age + Sex + Edu + Stress*Time*HLI + (1|ID)))
summary(pool(MIDmod1))
interact_plot(
model=MIDmod1,
pred = Time,
modx=Stress,
mod2=HLI,
data = df.mids,
interval=TRUE,
y.label='Global cognition composite score',
modx.labels=c('Low Baseline Stress (-1SD)','Moderate Baseline Stress (Mean)', 'High Baseline Stress (+1SD)'),
mod2.labels=c('Low HLI (-1SD)', 'Moderate HLI (Mean)', 'High HLI (+1SD)'),
legend.main='') + ylim(-2,2)
Error:
Error in rep(1, times = nrow(data)) : invalid 'times' argument
Note - I also get an error if I don't include the data argument (optional argument for this function).
Error in formula.default(object, env = baseenv()) : invalid formula
BTW - I am able to generate the plot when the model comes from a data frame - an example of what this should look like is included here: 1
Sorry, but it won’t be that easy. Multiple imputation object will definitely require special treatment, and none of the many R packages which can plot interactions are likely to work out of hte box.
Here’s a minimal example, adapted from the multiple imputation vignette of the marginaleffects package. (Disclaimer: I am the author.)
library(mice)
library(lme4)
library(ggplot2)
library(marginaleffects)
# insert missing data in an existing dataset and impute
iris_miss <- iris
iris_miss$Sepal.Width[sample(1:nrow(iris), 20)] <- NA
iris_mice <- mice(iris_miss, m = 20, printFlag = FALSE, .Random.seed = 1024)
iris_mice <- complete(iris_mice, "all")
# fit a model on 1 imputed datatset and use the `plot_predictions()` function
# with the `draw=FALSE` argument to extract the data that we want to plot
fit <- function(dat) {
mod <- lmer(Sepal.Width ~ Petal.Width * Petal.Length + (1 | Species), data = dat)
out <- plot_predictions(mod, condition = list("Petal.Width", "Petal.Length" = "threenum"), draw = FALSE)
# `mice` requires a unique row identifier called "term"
out$term <- out$rowid
class(out) <- c("custom", class(out))
return(out)
}
# `tidy.custom()` is needed by `mice` to combine datasets, but the output of fit() also has
# the right structure and column names, so it is useless
tidy.custom <- function(x, ...) return(x)
# Fit on each imputation
mod_mice <- lapply(iris_mice, fit)
# Pool
mod_pool <- pool(mod_mice)$pooled
# Merge back some of the covariates
datplot <- data.frame(mod_pool, mod_mice[[1]][, c("Petal.Width", "Petal.Length")])
# Plot
ggplot(datplot, aes(Petal.Width, estimate, color = Petal.Length)) +
geom_line() +
theme_minimal()

(R) Adding Confidence Intervals To Plots

I am using R. I am following this tutorial over here (https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/ ) and I am trying to adapt the code for a similar problem.
In this tutorial, a statistical model is developed on a dataset and then this statistical model is used to predict 3 news observations. We then plot the results for these 3 observations:
#load libraries
library(survival)
library(dplyr)
library(ranger)
library(data.table)
library(ggplot2)
#use the built in "lung" data set
#remove missing values (dataset is called "a")
a = na.omit(lung)
#create id variable
a$ID <- seq_along(a[,1])
#create test set with only the first 3 rows
new = a[1:3,]
#create a training set by removing first three rows
a = a[-c(1:3),]
#fit survival model (random survival forest)
r_fit <- ranger(Surv(time,status) ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data = a, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE)
#create new intermediate variables required for the survival curves
death_times <- r_fit$unique.death.times
surv_prob <-data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob, mean)
#use survival model to produce estimated survival curves for the first three observations
pred <- predict(r_fit, new, type = 'response')$survival
pred <- data.table(pred)
colnames(pred) <- as.character(r_fit$unique.death.times)
#plot the results for these 3 patients
plot(r_fit$unique.death.times, pred[1,], type = "l", col = "red")
lines(r_fit$unique.death.times, r_fit$survival[2,], type = "l", col = "green")
lines(r_fit$unique.death.times, r_fit$survival[3,], type = "l", col = "blue")
From here, I would like to try an add confidence interval (confidence regions) to each of these 3 curves, so that they look something like this:
I found a previous stackoverflow post (survfit() Shade 95% confidence interval survival plot ) that shows how to do something similar, but I am not sure how to extend the results from this post to each individual observation.
Does anyone know if there is a direct way to add these confidence intervals?
Thanks
If you create your plot using ggplot, you can use the geom_ribbon function to draw confidence intervals as follows:
ggplot(data=...)+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )
You can put + after geom_line and repeat the same steps for each observation.
You can also check:
Having trouble plotting multiple data sets and their confidence intervals on the same GGplot. Data Frame included and
https://bookdown.org/ripberjt/labbook/appendix-guide-to-data-visualization.html

Fit a GEE-model of type "exchangeable" with gamm

I would like to estimate a smooth effect of some covariate N in a marginal model of type "exchangeable" in R, where the clustering variable is S. From what I could find, this should be possible with:
geeglm(..., id = S, corstr = "exchangeable")
as well as:
gamm(..., correlation = corCompSymm(form = ~1|S))
Below you can find an example where the results look good in a sense that the two estimates are quite close. However, if I use the real data our project is about, the estimated smooth effects tend to be very different. I cannot publish that here, but maybe someone can still spot some problem in the code. For instance (see below), the gamm-object says Number of Groups: 1 which worries me as there clearly is more than one cluster...
(Yes, this is the realisation of a random-effects-model by construction, but this should lead to the desired model given the answer here.)
########
## Packages
########
library(ggplot2)
library(mgcv)
library(dplyr)
library(geepack)
library(splines)
########
## Data Simulation
########
f <- function(N) {return((-200+(N-25)^2)/100)}
N <- sort(sample(1:50, 10, replace = T))
S <- as.character(1:10)
S_Effect <- rnorm(length(S),0,1)
S_Effect <- rep(S_Effect,N)
S <- rep(S,N)
N <- rep(N,N)
E <- runif(length(N))
data <- data.frame(O = rep(0,length(N)),
E = E,
N = N,
S = as.factor(S),
S_Effect = S_Effect)
for (i in 1:length(N)) {
data$O[i] <- rbinom(1, 1, plogis(f(N[i]) + qlogis(E[i]) + S_Effect[i]))}
data <- data %>% mutate(E = qlogis(E))
########
## Fitting
########
formula_gamm <- as.formula("O ~ 1 + offset(E) + s(N, bs = 'bs')")
model_gamm <- gamm(formula_gamm, family = binomial(), correlation = corCompSymm(form=~1|S), data = data)
model_gamm
formula_geeglm <- as.formula("O ~ 1 + offset(E) + bs(N)")
model_geeglm <- geeglm(formula_geeglm, family = binomial(), corstr = "exchangeable", id = S, data = data)
########
## Plot
########
pred_gamm <- plot.gam(model_gamm$gam, select = 1)
x <- pred_gamm[[1]]$x
pred_geeglm <- predict(model_geeglm, type = "terms", newdata = data.frame(E = rep(0,length(x)), N = x))
z <- qnorm(0.9)
tmp <- data.frame(x = x,
y = pred_gamm[[1]]$fit,
group = rep("estimate gamm",length(x)))
tmp2 <- data.frame(x = x,
y = as.numeric(pred_geeglm),
group = rep("estimate geeglm",length(x)))
tmp3 <- data.frame(x = x,
y = f(x),
group = rep("actual function",length(x)))
data_pred = bind_rows(tmp,tmp2,tmp3) %>% mutate(group = as.factor(group))
p <- ggplot(data = data_pred, aes(x = x, y = y, color = group)) +
geom_line(size = 2) +
xlab("N") +
ylab("f(N)")
p
An additional question: The gamm-object contains enough information to plot a confidence-band around the estimated function, but how can I do this for the geeglm-estimate? You get something that looks reasonable if you simulate(model_geeglm, ...) and take the pointwise mean and so on, but that doesn't really satisfy me as (1) the documentation on simulate doesn't mention marginal models and (2) it is very primitive...
The GAMM is using penalised splines, such that the degrees of freedom used by the resulting spline (smoother) is likely to be somewhat less than the requested basis dimension, which is 10. The GEE is fitting an unpenalized model. All else equal, the unpenalised model will be more wiggly than the penalised one.
To compare these approaches on a common footing, you need to make sure that bs() and s(x, bs = 'bs') both produce the same number of basis functions (The s() version can produce one fewer as it will remove the lack identifiability with the intercept term, whereas you are omitting the intercept in the bs() version).
Having assured yourself that you get the same basis dimension, then you can make GAMM fit an unpenalized spline by adding fx = TRUE to the s(...) term in the formula.
Having done that, both models should be estimating similar smooth effects.
However, I would suggest that you use penalisation; For the GAMM model, use fx = FALSE, and then after estimating the model run gam.check(model$gam) (replacing model with your fitted model object) and see if the basis size check passes for the smoother.

Segmented linear regression with discontinuous data

I have a dataset that looks to be piecewise linear. I would like to perform a segmented linear regression in R. The issue is that there is a discontinuity at the breakpoint. By using some pieces of code from this question I managed to get something, but I am not satisfied.
Dataset
Here is a dummy dataset.
NB = 100
A1 = 2 # coeff for first part
A2 = 1 # coeff for second part
B1 = 0 # intercept for first part
B2 = 300 # intercept for second part
df = data.frame(n=1:NB)
df$n = sample(500, size=NB, replace=TRUE)
df$noise = sample(20, size=NB, replace=TRUE)-10
my_func <- function(n, noise) {
if(n < 100) {
return(A1*n+B1 + noise)
}
else {
return(A2*n+B2 + noise)
}
}
df$fn = mapply(my_func, df$n, df$noise)
Using segmented package
This is quite straightforward, we simply perform a classical linear regression and give it to segmented.
library(segmented)
library(ggplot2)
model_segmented = segmented(lm(fn~n, data=df), seg.Z = ~ n)
predict_segmented = data.frame(n = df$n, fn = broken.line(model_segmented)$fit)
ggplot(df, aes(x = n, y = fn)) +
geom_point() + geom_line(data = predict_segmented, color = 'blue')
Gives:
Obviously, segmented expects the data to be continuous. It is not the case here, so the regression is not correct.
“Manual” method
This method is more tedious. First, we compute the break-point by trying all the possible break points and keeping the one which yields the lowest residual. Then, we add a new factor in the linear regression, which tells if the predictor variable is greater or lower than this breakpoint.
# Computation of the break-point
Break<-sort(unique(df$n))
Break<-Break[2:(length(Break)-1)]
d<-numeric(length(Break))
for (i in 1:length(Break)) {
model_manual<-lm(fn~(n<Break[i])*n + (n>=Break[i])*n, data=df)
d[i]<-summary(model_manual)[[6]]
}
breakpoint = Break[which.min(d)]
# Linear regression using this break-point
df$group = df$n >= breakpoint
model_manual<-lm(fn~n*group, data=df)
dat_pred = data.frame(n = df$n, fn = predict(model_manual, df))
ggplot(df, aes(x = n, y = fn)) +
geom_point() +
geom_line(data=dat_pred[dat_pred$n < breakpoint,], color = 'blue') +
geom_line(data=dat_pred[dat_pred$n >= breakpoint,], color = 'blue')
Gives:
Here, the regression is great.
Question
Is there a better way to achieve this goal? Can the segmented package take discontinuous data, or is there a package that can do this?
My concern is that the second method is a bit long and not very readable.
After spending a tremendous amount of time digging, I believe the chngpt package is the way to go. It can do both continuous and discontinuous segmented regressions. Link here: https://cran.r-project.org/web/packages/chngpt/vignettes/chngpt-vignette.pdf
strucchange will detect the breakpoint using statistically valid methods. Then, you can fit each piece with whatever model you want. For example, with a seasonal time series you can apply separate ARIMA models to each segment.

Plotting predicted survival curves for continuous covariates in ggplot

How can I plot survival curves for representative values of a continuous covariate in a cox proportional hazards model? Specifically, I would like to do this in ggplot using a "survfit.cox" "survfit" object.
This may seem like a question that has already been answered, but I have searched through everything in SO with the terms 'survfit' and 'newdata' (plus many other search terms). This is the thread that comes closest to answering my question so far: Plot Kaplan-Meier for Cox regression
In keeping with the reproducible example offered in one of the answers to that post:
url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
df <- read.table(url, header = TRUE)
library(dplyr)
library(ggplot2)
library(survival)
library(magrittr)
library(broom)
# Identifying the 25th and 75th percentiles for prio (continuous covariate)
summary(df$prio)
# Cox proportional hazards model with other covariates
# 'prio' is our explanatory variable of interest
m1 <- coxph(Surv(week, arrest) ~
fin + age + race + prio,
data = df)
# Creating new df to get survival predictions
# Want separate curves for the the different 'fin' and 'race'
# groups as well as the 25th and 75th percentile of prio
newdf <- df %$%
expand.grid(fin = levels(fin),
age = 30,
race = levels(race),
prio = c(1,4))
# Obtain the fitted survival curve, then tidy
# into a dataframe that can be used in ggplot
survcurv <- survfit(m1, newdata = newdf) %>%
tidy()
The problem is, that once I have this dataframe called survcurv, I cannot tell which of the 'estimate' variables belongs to which pattern because none of the original variables are retained. For example, which of the 'estimate' variables represents the fitted curve for 30 year old, race = 'other', prio = '4', fin = 'no'?
In all other examples i've seen, usually one puts the survfit object into a generic plot() function and does not add a legend. I want to use ggplot and add a legend for each of the predicted curves.
In my own dataset, the model is a lot more complex and there are a lot more curves than I show here, so as you can imagine seeing 40 different 'estimate.1'..'estimate.40' variables makes it hard to understand what is what.
Thanks for providing a well phrased question and a good example. I'm a little surpirsed that tidy does a relatively poor job here of creating sensible output. Please see below for my attempt at creating some plottable data:
library(tidyr)
newdf$group <- as.character(1:nrow(newdf))
survcurv <- survfit(m1, newdata = newdf) %>%
tidy() %>%
gather('key', 'value', -time, -n.risk, -n.event, -n.censor) %>%
mutate(group = substr(key, nchar(key), nchar(key)),
key = substr(key, 1, nchar(key) - 2)) %>%
left_join(newdf, 'group') %>%
spread(key, value)
And the create a plot (perhaps you'd like to use geom_step instead, but there is not step shaped ribbon, unfortunately):
ggplot(survcurv, aes(x = time, y = estimate, ymin = conf.low, ymax = conf.high,
col = race, fill = race)) +
geom_line(size = 1) +
geom_ribbon(alpha = 0.2, col = NA) +
facet_grid(prio ~ fin)
Try defining your survcurv like this:
survcurv <-
lapply(1:nrow(newdf),
function(x, m1, newdata){
cbind(newdata[x, ], survfit(m1, newdata[x, ]) %>% tidy)
},
m1,
newdf) %>%
bind_rows()
This will include all of the predictor values as columns with the predicted estimates.

Resources