Plot_cap response curve for counterfactual data - r

The following code was made up to replicate my problem with a bigger, more complex data set.
library(marginaleffects)
library(truncnorm)
yield_kgha<-rtruncnorm(n=100, mean=2000, sd=150)
n_kgha<-rtruncnorm(n=100, a=40, b=298, mean=150, sd=40)
i<-lm(yield_kgha~n_kgha+I(n_kgha^2))
summary(i)
I have used the predictions function from the marginaleffects package to see the predicted yields(yield_kgha) for a range of nitrogen rates (n_kgha) from my regression model (i). My original data only has n_kgha rates ranging from approximately 40-250, so the below code allows me to see predicted yields at n_kgha rates not actually in my data set.
p <- predictions(
i,
newdata = datagrid(model = i, n_kgha = seq(0, 300, by = 25), grid_type = "counterfactual"))
summary(p, by = "n_kgha")
I would like to plot the response of yield conditioned on n_kgha ranging from 0-300, which includes n_kgha values not in my original data set. I have tried to do this using the plot_cap function.
plot_cap(i, condition = "n_kgha")
However, since my original data only has n_kgha rates ranging from 40-250 I am not getting my desired result of seeing the response curve over the n_kgha (0:300) range. When I plot using the plot_cap function I get the following response curve with n_kgha ranging from 40-250 (the max and min of the original data set).
Is there a way to run the plot_cap function based on the counterfactual range of n_kgha as used in the predictions function? Or should I use another method to plot the predicted values based on counterfactual values?

The plot_cap() function only plots over the observed range of values. However, since predictions() returns a “tidy” data frame, it is trivial to use the the output of this function with ggplot2 to get the plot you need.
Note that we do not need to specify grid.type="counterfactual". This is option will do something very weird and specific: duplicate the whole dataset many times for each value of the user-supplied values. It is only useful in very specific corner-cases, and not when you just want to make predictions over unobserved values of the predictors. See the documentation with ?datagrid.
Here’s a simple example of predictions() with ggplot2 to achieve what you want:
library(marginaleffects)
library(truncnorm)
library(ggplot2)
yield_kgha <- rtruncnorm(n = 100, mean = 2000, sd = 150)
n_kgha <- rtruncnorm(n = 100, a = 40, b = 298, mean = 150, sd = 40)
i <- lm(yield_kgha ~ n_kgha + I(n_kgha^2))
p <- predictions(i, newdata = datagrid(n_kgha = seq(0, 300, 10)))
ggplot(p, aes(n_kgha, predicted)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .1) +
geom_line() +
theme_minimal()

Related

Simulating rate data (negative-binomial distribution)?

I am attempting to simulate data that approximates rate data - that is: count data that generally fit a negative binomial distribution but also an offset term for survey effort.
I think I can simulate counts well using the negative-binomial function (rnbinom()), but then don't have a way to account for the offset term - which will be random with each survey. Put another way:
What is the best way to simulate non-integer rate data?
Is there a realistic way to simulate the offset term values?
Do I need to use a different distribution beyond negative-binomial to generate a realistic range of non-integer values?
Background: Our surveys measure counts of individual per unit of survey effort (time), and the resulting rate is a positive non-integer ( >= 0). The survey count data seems well modeled with a negative-binomial distribution, and in a GLM framework, I would account for effort using survey time as an offset term. In the simulated data below, I generate a negative-binomial distribution to represent data within my actual survey. The offset term is simulated as a random uniform variable between 2-10 (the range of search times in minutes). Rate is then calculated as counts/time.
I plot histograms of both counts and rate to help demonstrate that rates here take many fractional values between integers. Because survey counts are often correlated with survey effort, it is critical that I ultimately use the rate data for analysis (i.e. figure 'B' below).
library(tidyverse)
theme_set(theme_classic())
d = data.frame(counts = rnbinom(n = 500,mu = 5,size = 1), ## dispersion parameter 'theta' set to 1
time = runif(500,2,10)) %>%
mutate(rate = counts/time)
## Count data histogram
ggplot(d, aes(x = counts))+
geom_histogram(fill = 'peachpuff',color = 'black')+
ylab('frequency')+
scale_y_continuous(expand = c(0,0))+
ggtitle('A: Histogram of Counts')
## Rate data histogram
ggplot(d, aes(x = rate))+
geom_histogram(binwidth = .1, fill = 'dodgerblue1',color = 'black')+
scale_x_continuous(breaks = seq(0,10,1))+
scale_y_continuous(expand = c(0,0))+
ylab('frequency')+
ggtitle('B: Histogram of Rate')
Below I can readily simulate counts from my original survey data, but don't know how to properly simulate back data as a rate. For example, if I fit an intercept-only nbinom GLM , I can use the coefficient to simulate new negative-binomial distributions of counts that are very similar to the original data (i.e. uses a similar value for 'mu')
[I realize this seems circular in this example, but this is my approach with real data. First describe the mean value and dispersion 'theta' with a GLM, then simulate back datasets that mimic my original dataset]
I use this approach below both to generate back count data, but also by fitting a model with the offset term in order to simulate back a distribution that has the mean 'rate' from figure 'B'.
### Simulate back count data from the original survey data:
##describe mean value 'mu' by finding intercept
## 'theta' could also be calculated
m1 = MASS::glm.nb(counts ~ 1, data = d)
# summary(m1)
# mean(d$counts)
# exp(m1$coefficients[1])
## simulated negative-binomial distribution using calculated 'mu'
d.sim = data.frame(new.counts = rnbinom(500,
mu = as.numeric(exp(m1$coefficients[1])), ## coef on log-scale, exponentiate to use
size = 1)) ## holding dispersion parameter 'theta' constant at 1
## Plot and compare with plot 'A' above
ggplot(d.sim, aes(x = new.counts))+
geom_histogram(fill = 'peachpuff3',color = 'black')+
ylab('frequency')+
scale_y_continuous(expand = c(0,0))+
ggtitle('C: Simulated Counts')
###########################################
###########################################
### Simulate back 'rate' data by including an offset term for effort in the GLM model
## the exponentiated coefficient should equal the mean of the raw rate data
m2 = MASS::glm.nb(counts ~ 1 + offset(log(time)), data = d)
# summary(m2)
# mean(d$rate)
# exp(m2$coefficients[1])
d.sim.2 = data.frame(new.counts = rnbinom(500,
mu = as.numeric(exp(m2$coefficients[1])), ## coef on log-scale, exponentiate to use
size = 1)) ## holding dispersion parameter 'theta' constant at 1
## compare these simulated 'rate' data with the non-integer 'true rate' data in figure D
ggplot(d.sim.2, aes(x = new.counts))+
geom_histogram(binwidth = .1, fill = 'dodgerblue3',color = 'black')+
scale_x_continuous(breaks = seq(0,10,1))+
ylab('frequency')+
scale_y_continuous(expand = c(0,0))+
ggtitle('D: Simulated Rate')
So it is at this point that I've generated figure 'C' as a simulated dataset representing counts that I have observed in real life, which closely matches the original data in figure 'A'. The 'rate' data in figure 'D' is (necessarily) all integer values drawn from rnbinom(), and while the mean of figure 'D' is approximate to the mean of figure 'B', my sense is that these two distributions are not really equivalent.
So my questions again:
Is there a way that I could instead simulate data to match figure 'B' (non-integer rate data)?
Do you think that data in figure 'D' will work as an approximate to 'B' since the mean values (and dispersion) are similar?
For additional context, I'll be using the simulated datasets (many of them) to run other Monte-Carlo type simulation analysis (e.g. power analysis). I'm worried that if I use data generated in Figure 'D', it won't really represent what my actual survey data will be (figure 'B').
The way you generate your sample data (in place of your empirical data), does not align with the data generating process you describe. The count data from rnbinom(n = 500, mu = 5, size = 1) does not depend on the time. mu needs to be a function of the time variable, or else the counts are independent of time.
Also, setting size = 1 means there is no overdispersion (nor underdispersion), thus it should rather be called a Poisson distribution, which is a special case of the negative binomial distribution. But given your description of the DGP it sounds like there would be overdispersion in the empirical data.
To answer your first question, you see a code example below. Regarding your second question, no I don't think that would be a good idea.
library(tidyverse)
library(rstanarm)
options(mc.cores = parallel::detectCores())
n <- 1000
empirical <-
tibble(
time = runif(n, 2, 10),
count = rnbinom(n = n, mu = time, size = 1) # Generate count data that actually depends on time
) |>
mutate(rate = count/time)
m_stan <- stan_glm.nb(count ~ time, data = empirical)
simulated <-
tibble(
time = runif(n, 2,10),
) %>%
mutate(
count = posterior_predict(m_stan, ., draws = 1) |>
as.vector(),
rate = count/time
)
d <- lst(simulated, empirical) |>
bind_rows(.id = "data")
d |>
select(data, count, rate) |>
pivot_longer(c(count, rate)) |>
ggplot() +
geom_histogram(aes(value), binwidth = .2) +
facet_grid(data ~ name, scales = "free")
Created on 2022-02-03 by the reprex package (v2.0.1)

(R) Adding Confidence Intervals To Plots

I am using R. I am following this tutorial over here (https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/ ) and I am trying to adapt the code for a similar problem.
In this tutorial, a statistical model is developed on a dataset and then this statistical model is used to predict 3 news observations. We then plot the results for these 3 observations:
#load libraries
library(survival)
library(dplyr)
library(ranger)
library(data.table)
library(ggplot2)
#use the built in "lung" data set
#remove missing values (dataset is called "a")
a = na.omit(lung)
#create id variable
a$ID <- seq_along(a[,1])
#create test set with only the first 3 rows
new = a[1:3,]
#create a training set by removing first three rows
a = a[-c(1:3),]
#fit survival model (random survival forest)
r_fit <- ranger(Surv(time,status) ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data = a, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE)
#create new intermediate variables required for the survival curves
death_times <- r_fit$unique.death.times
surv_prob <-data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob, mean)
#use survival model to produce estimated survival curves for the first three observations
pred <- predict(r_fit, new, type = 'response')$survival
pred <- data.table(pred)
colnames(pred) <- as.character(r_fit$unique.death.times)
#plot the results for these 3 patients
plot(r_fit$unique.death.times, pred[1,], type = "l", col = "red")
lines(r_fit$unique.death.times, r_fit$survival[2,], type = "l", col = "green")
lines(r_fit$unique.death.times, r_fit$survival[3,], type = "l", col = "blue")
From here, I would like to try an add confidence interval (confidence regions) to each of these 3 curves, so that they look something like this:
I found a previous stackoverflow post (survfit() Shade 95% confidence interval survival plot ) that shows how to do something similar, but I am not sure how to extend the results from this post to each individual observation.
Does anyone know if there is a direct way to add these confidence intervals?
Thanks
If you create your plot using ggplot, you can use the geom_ribbon function to draw confidence intervals as follows:
ggplot(data=...)+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )
You can put + after geom_line and repeat the same steps for each observation.
You can also check:
Having trouble plotting multiple data sets and their confidence intervals on the same GGplot. Data Frame included and
https://bookdown.org/ripberjt/labbook/appendix-guide-to-data-visualization.html

Using predictNLS to create confidence intervals around fitted values in R?

I want to build confidence intervals around a large set of fitted values using predictNLS from the propogate package in R. As an example, I will use the data set they reference in the function description (https://rdrr.io/github/anspiess/propagate/man/predictNLS.html), DNase, and building a model that takes the values conc and density as features:
library(propogate)
library(dplyr)
library(modelr)
DNase <- DNase
modeldna <- DNase %>% group_by(Run) %>%
do(run_model = nls(density ~ a * exp(b * conc),
start = list(a = 1 , b = 0.5),
data = .)) %>% ungroup()
I then want to give each row the model that it is assigned to so that predictions can be added:
DNApredict <- full_join(as_tibble(DNase), modeldna, by = "Run")
Add in the predictions:
DNApredict <- DNApredict %>%
group_by(Run) %>%
do(add_predictions(., var = "predicted_density", first(.$run_model)))
And then, I want to add the confidence interval data that predictNLS seems to provide, by giving it that same data and asking it to give a confidence interval for each fitted point in the predicted_density column:
confidence_interval <- predictNLS(model = modeldna, newdata = DNApredict$predicted_density, interval = "confidence")
However, the following error arises:
Error in as.list(object$call$formula) :
argument "object" is missing, with no default
Does anyone know what might be causing this? I know that it will likely seem obvious to some of you what the object it is calling is, so I apologize if this is a ridiculous question. I am really hoping to be able to use this function to create confidence intervals around a series of fitted values. Thank you very much in advance.
Since you are running an nls on each Run in the sample data set, it is easy to get a list of nls models by splitting each run into its own data frame, and running nls on each data frame using lapply
library(propagate)
DNase <- DNase
modeldna <- DNase %>% split(DNase$Run)
models <- lapply(modeldna, function(d) nls(density ~ a * exp(b * conc),
start = list(a = 1 , b = 0.5),
data = d))
Now we can get predictions for each point in each model just as easily by running predictNLS on each model (again inside lapply)
results <- lapply(seq_along(modeldna), function(i) {
predictNLS(models[[i]], newdata = data.frame(conc = modeldna[[i]]$conc))
})
Because of the output structure of predictNLS, we need to extract the predictions for each row and coerce them into a data frame:
predictions <- lapply(results, function(x) {
as.data.frame(do.call(rbind, lapply(x$prop, function(y) y$prop)))})
Finally, we can stick our predictions (including confidence intervals) back onto the original data frame:
all_results <- do.call(rbind, lapply(seq_along(modeldna),
function(i) cbind(modeldna[[i]], predictions[[i]])))
This now gives us a complete data frame of original data points, and the relevant predictions with confidence intervals.
To show this, we can plot the results in ggplot. Here we show one plot for each run, including its original data, the predicted value as a dotted line, and the 95% confidence limit as a pale blue ribbon:
library(ggplot2)
ggplot(all_results, aes(x = conc, y = density)) +
geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`),
fill = "deepskyblue4", alpha = 0.2) +
geom_point() +
geom_line(aes(y = Mean.1), linetype = 2) +
facet_wrap(.~factor(Run, levels = 1:11)) +
theme_bw()

Drawing 95% credible intervals for my bayesian predictions along with the Points from the actual observed value of the response variable

The response variable for my dataset is comprised of observations Y[1], Y[2], ...., Y[49]. I came up with a Bayesian Hierarchical Model to make Bayesian predictions for Y[50]. I also have MCMC samples for Y[1],...,Y[49], which I can use to assess the overall fit of my Bayesian model by comparing them with the actual values of Y[1], Y[2], ...., Y[49].
Is there any way that I can draw the caterpillar plots of my Bayesian Predictions from the MCMC object of the Hierarchical Model along with the points that stands for actual observed Y's from my original dataset on R?
Thank you,
First you need to extract your confidence intervals for each $Y_i$ . (usually this is done with quantile function if you're not using a standard S3 object).
Then you create the following df:
df <- data_frame(
obs = seq(from = 1,
to = 49,
by = 1),
lower = q1,
upper = q2,
estimate = estimate,
actual = actual)
Then you go:
df %>% ggplot(aes(x = obs)) +
geom_line(aes(y = actual)) +
geom_pointrange(aes(ymin = lower, ymax = upper, y = estimate)) +
coord_flip()
If you're doing hierarchical models I really recommend using rstanarm package which is compatible with the tidybayes library (which produces automatic caterpillar plots).

Draw fitted Exgaussian density curve in ggplot2

I have a set of estimated parameters for an Ex-gaussian curve (i.e. mu, sigma, tau).
Currently I'm creating a visualization of that distribution by simulating data based on those parameters and plotting them in ggplot.
I would rather create a visualization that is effectively a smooth fitted ex-gaussian curve - i.e. an estimated curve for data that presents with the parameters I've estimated. The goal is to not have curves with the same parameters appear differently.
Here is the current simulation approach I'm utilizing:
library(retimes)
library(ggplot2)
g <- rexgauss(1000,mu=1,sigma = 1,tau =1)
g <- as.data.frame(g); colnames(g) <- "obs"
ggplot(g) + geom_density(aes(x = obs), size=1, alpha=.4)
You can use stat_function from ggplot2. It takes a function in fun, and parameters to pass to that function in args. It works well for situations like this where you want to compare a simulation to a calculated distribution, because the x values you supply to aes will be the ones automatically used in showing the function, without you having to do any work to match them up or calculate the range of x values in your simulation.
Here's an example with retimes::rexgauss. I also simplified your data frame creation, and put the parameters in a vector so you can use them in both the simulation and the calculated function.
My laptop is too slow to do all 1000 observations, so yours is probably smoother and closer to the calculated distribution than mine.
library(ggplot2)
exgauss_params <- c(mu = 1, sigma = 1, tau = 1)
exgauss_sim <- data.frame(obs = retimes::rexgauss(n = 100, exgauss_params))
ggplot(exgauss_sim, aes(x = obs)) +
geom_density(aes(color = "simulated")) +
stat_function(aes(color = "calculated"),
fun = retimes::dexgauss, args = exgauss_params)
Created on 2018-05-18 by the reprex package (v0.2.0).

Resources