Plotting Dose Response Curve from Survival Data - r

I would like to make a dose response curve from the library(drc), and am stuck on how to prepare my dataset properly in order to make the plot. In particular, I’m struggling how to get my y-axis ready.
I made up a dataframe (df) to help clarify what I would like to do.
df <- read.table("https://pastebin.com/raw/TZdjp2JX", header=T)
Open necessary libraries for today's exercise
library(drc)
library(ggplot2)
Let’s pretend I like humming birds, and do an experiment with different concentrations of sugar with the goal of seeing which concentration is ideal for humming birds. I therefore run an experiment in a closed setting (here column “room”), with 4 different sugar concentrations (column concentration), and 10 individual birds per concentration. I also run each experiment with 4 replicates in parallel, which is why there are 4 “rooms”. After 36 hours (column “time”), I go into the room, and check how many birds survived, creating a “yes/no” variable, or 1 & 0 (here, this is my column “status), where 1==survive, 0==died.
With this dataset, I specifically made it that most survived at concentration 0, 50% survived concentration 1, 25% survived concentration 2, and only 10% survive concentration 3.
My first issue I’m running into is : how can I turn my y-axis , generated from my “status” column into a percentage? I have done this when I do kaplan-meier survival curves, but that does not work here unfortunately. Obviously, this should column should go from 0% - 100% (we could call the column "mortality"). After I succeed at doing this, I would like to make a dose response curve that looks like the following (I found this example online, and will directly copy it here to use an example. It is from the ryegrass dataset included in R)
ryegrass.LL.4 <- drm(rootl ~ conc, data = ryegrass, fct = LL.3())
I must admit, the next steps of code are a little confusing for me.
# new dose levels as support for the line
newdata <- expand.grid(conc=exp(seq(log(0.5), log(100), length=100)))
# predictions and confidence intervals
pm <- predict(ryegrass.LL.4, newdata=newdata, interval="confidence")
# new data with predictions
newdata$p <- pm[,1]
newdata$pmin <- pm[,2]
newdata$pmax <- pm[,3]
# plot curve
# need to shift conc == 0 a bit up, otherwise there are problems with coord_trans
ryegrass$conc0 <- ryegrass$conc
ryegrass$conc0[ryegrass$conc0 == 0] <- 0.5
# plotting the curve
ggplot(ryegrass, aes(x = conc0, y = rootl)) +
geom_point() +
geom_ribbon(data=newdata, aes(x=conc, y=p, ymin=pmin, ymax=pmax), alpha=0.2) +
geom_line(data=newdata, aes(x=conc, y=p)) +
coord_trans(x="log") +
xlab("Ferulic acid (mM)") + ylab("Root length (cm)")
In the end, I would like to generate a similar curve, but with mortality on the y-axis, from 0-100 (starting low, going high) and also display the confidence intervals in a shaded grey area around the regression line. Meaning, my first step of code should like something like the following:
model <- drc(mortality ~ Concentration, data=df, fct = LL.3()) But I'm lost on the "mortality" creation part, and a little bit on the next step with ggplot
Could anyone help me achieve this? From the example from ryegrass, I'm perplexed how to translate this to be helpful for my pretend dataset. I hope someone here is able to help me solve this issue! Many thanks, and I appreciate any feedback if there are other ways I should have my dataset structured, etc.
-Andy

library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(drc)
#> Loading required package: MASS
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#> select
#>
#> 'drc' has been loaded.
#> Please cite R and 'drc' if used for a publication,
#> for references type 'citation()' and 'citation('drc')'.
#>
#> Attaching package: 'drc'
#> The following objects are masked from 'package:stats':
#>
#> gaussian, getInitial
df <- read.table("https://pastebin.com/raw/sH5hCr2J", header=T)
Making the mortality or as I do here survival, can be done easily with the dplyr package. This will help perform many calculations. It seems that you are interested in calcuating the percent survival for each Concentration across your four rooms (or replicates). So the first step is to group the data by these columns and then calculate the statistic we want.
df_calc <- df %>%
group_by(Concentration, room) %>%
summarise(surv = sum(Status)/n())
#> `summarise()` has grouped output by 'Concentration'. You can override using the `.groups` argument.
I don’t know if Concentration represents arbitrary
Concentration levels so I’m moving forward with the following
assumption:
1 == higher levels of sugar, 2 == lower levels of sugar
Concentrations were coding in log space - so I convert to linear space
df_calc <- mutate(df_calc, conc = exp(-Concentration))
Just to be clear, the conc variable is just my attempt at having something close to the true known concentrations of the experiment. If your data has the true concentrations, then don't mind this calculation.
df_calc
#> # A tibble: 12 x 4
#> # Groups: Concentration [3]
#> Concentration room surv conc
#> <int> <int> <dbl> <dbl>
#> 1 1 1 0.5 0.368
#> 2 1 2 0.4 0.368
#> 3 1 3 0.5 0.368
#> 4 1 4 0.6 0.368
#> 5 2 1 0 0.135
#> 6 2 2 0.4 0.135
#> 7 2 3 0.2 0.135
#> 8 2 4 0.4 0.135
#> 9 3 1 0.2 0.0498
#> 10 3 2 0 0.0498
#> 11 3 3 0 0.0498
#> 12 3 4 0.2 0.0498
mod <- drm(surv ~ conc, data = df_calc, fct = LL.3())
Make new conc data points
newdata <- data.frame(conc = exp(seq(log(0.01), log(10), length = 100)))
EDIT
To respond to your comment I'll explain the above code chunk. Again the conc variable is expected to be the unit concentration. In this hypothetical case, we have three concentration levels c(0.049, 0.135, 0.368). For brevity, lets assume the units are mg of sugar/ml of water. Our model was fit on these three dose levels with 4 data points per dose level. If we wanted, we could have just plotted the curve between these levels of c(0.049, 0.368), but in this example, I chose c(0.01, 10) mg/ml as the domain to plot on. This was just so that we could visualize where the curve would end up based on the model fit. In short, you choose the range that you are interested in most. As I show later - even though we can choose data points outside the range of the experimental data, the confidence intervals are extremely large indicating the model will be unhelpful for those points.
The reason behind casting these values with the log() function is to ensure that we are sampling points that look evenly distributed on a log10 scale (most does response curves are plotted with this transformation). Once we get the sequence of 100 points, we use exp() to return back to the linear space (which our model was fit on). These values are then used in the predict function as the new dose levels in conjunction with the fitted model.
All this is saved into newdata variable which allows for the plotting of the line and the confidence intervals.
Use the model and the generated data points to
predict a new surv value as well as the upper and lower bound
newdata <- cbind(newdata,
suppressWarnings(predict(mod, newdata = newdata, interval="confidence")))
plot with ggplot2
ggplot(df_calc, aes(conc)) +
geom_point(aes(y = surv)) +
geom_ribbon(aes(ymin = Lower, ymax = Upper), data = newdata, alpha = 0.2) +
geom_line(aes(y = Prediction), data = newdata) +
scale_x_log10() +
coord_cartesian(ylim = c(0, 1))
As you may notice, the confidence intervals increase greately when we try and
predict ranges that has no data.
Created on 2021-10-27 by the reprex package (v1.0.0)

Related

Understanding the Output Coefficients from a Linear Model Regression in R

I'm reading a fairly simple hypothesis textbook at the moment. It is being explained that the coefficients from a linear model, where the independent variables are two categorical variables with 2 and 3 factors respectively, and the dependent variable is a continuous variable should be interpreted as; the difference between the overall mean of the dependent variable (mean across all categorical variables and factors) and the mean of the dependent variable based on the values of the dependent variable from a given factorized categorical variable. I hope it's understandable.
However, when I try to reproduce the example in the book, I do not get the same coefficients, std. err., T- or P-values.
I created a reproducible example using the ToothGrowth dataset, where the same is the case:
library(tidyverse)
# Transforming Data to a Tibble and Change Variable 'dose' to a Factor:
tooth_growth_reprex <- ToothGrowth %>%
as_tibble() %>%
mutate(dose = as.factor(dose))
# Creating Linear Model of Variables in ToothGrowth (tg):
tg_lm <- lm(formula = len ~ supp * dose, data = tooth_growth_reprex)
# Extracting suppVC coefficient:
(coef_supp_vc <- tg_lm$coefficients["suppVC"])
#> suppVC
#> -5.25
# Calculating Mean Difference between Overall Mean and Supplement VC Mean:
## Overall Mean:
(overall_summary <- tooth_growth_reprex %>%
summarise(Mean = mean(len)))
#> # A tibble: 1 x 1
#> Mean
#> <dbl>
#> 1 18.8
## Supp VC Mean:
(supp_vc_summary <- tooth_growth_reprex %>%
group_by(supp) %>%
summarise(Mean = mean(len))) %>%
filter(supp == "VC")
#> # A tibble: 1 x 2
#> supp Mean
#> <fct> <dbl>
#> 1 VC 17.0
## Difference between Overall Mean and Supp VC Mean:
(mean_dif_overall_vc <- overall_summary$Mean - supp_vc_summary$Mean[2])
#> [1] 1.85
# Testing if supp_VC coefficient and difference between Overall Mean and Supp VC Mean is near identical:
near(coef_supp_vc, mean_dif_overall_vc)
#> suppVC
#> FALSE
Created on 2021-02-23 by the reprex package (v1.0.0)
My questions:
Am I understanding the interpretation of the coefficient values completely wrong?
What is the lm actually calculating regarding the coefficients?
Is there any functions in R that can calculate what I'm interested in, with me having to do it manually?
I hope this is enough information. If not, please don't hesitate to ask me!
The lm() function uses dummy coding, so all the coefficients in your model are compared to the reference group's mean. The reference group here is the first levels of your factors, so supp=OJ and dose=0.5
You can then do this verification like so:
coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] == mean_table %>% filter(supp=='VC' & dose==0.5) %>% pull(M)
(coef(tg_lm)["(Intercept)"] + coef(tg_lm)["suppVC"] + coef(tg_lm)["dose1"] + coef(tg_lm)["suppVC:dose1"]) == mean_table %>% filter(supp=='VC' & dose==1) %>% pull(M)
You can read into the differences here

Comparing point patterns, each with differing windows

I am working on patterning statistics in R, using the spatstat package. I have a bunch of ppp objects, and would like to compare them all to find slight differences in the patterns of them that I might miss by just looking at the heatmaps, etc. I would also like to quantify the differences between the patterns somehow.
One problem is that the windows are differently shaped (slightly) for each pattern.
I am applying spatstat to plant leaves; here are the resulting tesselations to give you an idea of window shape, etc.:
Tesselation, Leaf 1
Tesselation, Leaf 2
Tesselation, Leaf 3
How would I go about comparing the patterns, and seeing where they differ?
I would also like to see, for example, if after analyzing 10 patterns, there's commonly a band of increased density across the midsection of the leaf, that is hard to detect by simply looking at individual density images. Is there a way to go about this?
Some ideas with two artificial datasets:
library(spatstat)
set.seed(42)
W1 <- ellipse(1,2)
W2 <- rotate(W1, angle = pi/4)
P1 <- rpoispp(20, win = W1)
P2 <- rpoispp(20, win = W2)
plot(solist(P1=P1, P2=P2), main = "", equal.scales = TRUE)
Manually make a relevant center line for each window (you can use
ends <- clickppp(2, add = TRUE) on top of a plot of each individual pattern
to interactively choose the end points by clicking the plot and then use the
coordinates to create the line with psp):
L1 <- psp(0, 1, 0, -1, window = W1)
L2 <- rotate(L1, angle = pi/4)
plot(solist(L1=L1, L2=L2), main = "", equal.scales = TRUE)
Define the distance from the center line:
D1 <- distfun(L1)
D2 <- distfun(L2)
plot(solist(D1=D1, D2=D2), main = "", equal.scales = TRUE)
Then you can fit point process models with this covariate.
E.g. the simple log-linear model for one of the datasets:
ppm(P1 ~ D1)
#> Nonstationary Poisson process
#>
#> Log intensity: ~D1
#>
#> Fitted trend coefficients:
#> (Intercept) D1
#> 3.1368706 -0.1768286
#>
#> Estimate S.E. CI95.lo CI95.hi Ztest Zval
#> (Intercept) 3.1368706 0.1912389 2.7620491 3.5116920 *** 16.4028872
#> D1 -0.1768286 0.3328005 -0.8291056 0.4754484 -0.5313351
Where the log-linear effect of distance to the center is insignificant as
expected (since the data was generated with homogeneous intensity).
From here you can explore different types of models (e.g. proportional to
distance via offset(), Gibbs models, models for replicated experiments via
mppm(), which may be very relevant here, etc.). E.g. a joint log-linear
model for the two datasets:
dat <- hyperframe(points = list(P1, P2), linedist = list(D1, D2))
mppm(points ~ linedist, data = dat)
#> Point process model fitted to 2 point patterns
#> Call: mppm(points ~ linedist, data = dat)
#> Log trend formula: ~linedist
#> Fitted trend coefficients:
#> (Intercept) linedist
#> 2.98511651 0.06387864
#>
#> Interaction for all patterns: Poisson process
Chapters 9 (freely available online), 13 and 16 in the
spatstat book
may be useful (disclaimer: I’m a co-author).

How to estimate residuals of subgroup with lme4 in R

I'd like to reproduce the results reported in Hoffman & Rovine's work (Multilevel models for the experimental psychologist: foundations and illustrative examples) with lme4 package in R.
In their first example they compared reaction time between elders and young people. Each of their participants have many task trials. So, in the individual level, participants' reaction time were affected by various variables related to their manipulation of trials. In the second level, participants' age and age group would affect participants' reaction time.
In Hoffman's model 2B, they estimate first level residuals for elders and young people respectively, with two dummy variable for young people and old people.
Hoffman's equation is
Level1 equation
I'd like to know how to estimate two residuals in lme4 package.
Hoffman's article and example data could be found in Hoffman's website.
I've successfully replicated their result of model 2A, where the variance of young people and old people were assumed to be the same, with the following code.
lmer(lg_rt ~ c_mean + c_sal + (1|Item) + oldage + yrs65 + (1|id), Ex1, REML = F)
You can handle heteroscedasticity in lme4 using the modular fitting functions. Here is an example with two groups, which should be extendable to other types of heteroscedasticity. Note that although the weights are estimated, the uncertainty about the weights is not taken into account in the standard errors of the parameters in the final fit. This issue should be possible to solve using the delta method, see e.g. the first equation in Section 2.3.3 of https://10.3102/1076998611417628.
set.seed(1234)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(lme4)
#> Loading required package: Matrix
#>
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#>
#> expand, pack, unpack
n <- 100 # number of level-2 units
m <- 3 # number of repeated observations per unit
sd_b <- .3 # random intercept standard deviation
sd_eps1 <- .1 # residual standard deviation in group 1
sd_eps2 <- .3 # residual standard deviation in group 2
# Simulate data
dat <- tibble(
# unique ID
id = seq_len(n),
# explanatory variable, constant over repetitions
x = runif(n),
# random intercept
b = rnorm(n, sd = sd_b),
# group membership
grp = sample(1:2, n, replace = TRUE)
) %>%
uncount(3) %>%
mutate(
# residual
eps = rnorm(nrow(.), sd = c(sd_eps1, sd_eps2)[grp]),
# response, fixed effect is beta=1
y = x + b + eps
)
# now optimize over residual weights, fixing the group 1 weight to 1.
# optimize() would be sufficient, but I show it with optim() because it
# then can be directly extended to a larger number of groups
opt <- optim(
# initial value for group 2 residual relative to group 1
par = 1,
fn = function(weight){
# Compute weights from group variable
df <- dat %>%
mutate(weight = c(1, weight)[grp])
## 1. Parse the data and formula:
lmod <- lFormula(y ~ x + (1|id), data = df, weights = df$weight)
## 2. Create the deviance function to be optimized:
devfun <- do.call(mkLmerDevfun, lmod)
## 3. Optimize the deviance function:
opt <- optimizeLmer(devfun)
# return the deviance
opt$fval
},
# Use a method that allows box constraints
method = "L-BFGS-B",
# Weight cannot be negative
lower = 0.01
)
# The weight estimates the following ratio, and it is pretty close
sd_eps1^2/sd_eps2^2
#> [1] 0.1111111
opt$par
#> [1] 0.1035914
# We can now fit the final model at the chosen weights
df <- dat %>%
mutate(weight = c(1, opt$par)[grp])
mod <- lmer(y ~ x + (1|id), data = df, weights = df$weight)
# Our estimate of sd_eps1
sigma(mod)
#> [1] 0.09899687
# True value
sd_eps1
#> [1] 0.1
# Our estimate of sd_eps2
sigma(mod) * sqrt(1/opt$par)
#> [1] 0.307581
# True value
sd_eps2
#> [1] 0.3
Created on 2021-02-10 by the reprex package (v1.0.0)

Plot Kaplan-Meier for Cox regression

I have a Cox proportional hazards model set up using the following code in R that predicts mortality. Covariates A, B and C are added simply to avoid confounding (i.e. age, sex, race) but we are really interested in the predictor X. X is a continuous variable.
cox.model <- coxph(Surv(time, dead) ~ A + B + C + X, data = df)
Now, I'm having troubles plotting a Kaplan-Meier curve for this. I've been searching on how to create this figure but I haven't had much luck. I'm not sure if plotting a Kaplan-Meier for a Cox model is possible? Does the Kaplan-Meier adjust for my covariates or does it not need them?
What I did try is below, but I've been told this isn't right.
plot(survfit(cox.model), xlab = 'Time (years)', ylab = 'Survival Probabilities')
I also tried to plot a figure that shows cumulative hazard of mortality. I don't know if I'm doing it right since I've tried it a few different ways and get different results. Ideally, I would like to plot two lines, one that shows the risk of mortality for the 75th percentile of X and one that shows the 25th percentile of X. How can I do this?
I could list everything else I've tried, but I don't want to confuse anyone!
Many thanks.
Here is an example taken from this paper.
url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
Rossi <- read.table(url, header=TRUE)
Rossi[1:5, 1:10]
# week arrest fin age race wexp mar paro prio educ
# 1 20 1 no 27 black no not married yes 3 3
# 2 17 1 no 18 black no not married yes 8 4
# 3 25 1 no 19 other yes not married yes 13 3
# 4 52 0 yes 23 black yes married yes 1 5
# 5 52 0 no 19 other yes not married yes 3 3
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + wexp + mar + paro + prio,
data=Rossi)
mod.allison
# Call:
# coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
# mar + paro + prio, data = Rossi)
#
#
# coef exp(coef) se(coef) z p
# finyes -0.3794 0.684 0.1914 -1.983 0.0470
# age -0.0574 0.944 0.0220 -2.611 0.0090
# raceother -0.3139 0.731 0.3080 -1.019 0.3100
# wexpyes -0.1498 0.861 0.2122 -0.706 0.4800
# marnot married 0.4337 1.543 0.3819 1.136 0.2600
# paroyes -0.0849 0.919 0.1958 -0.434 0.6600
# prio 0.0915 1.096 0.0286 3.194 0.0014
#
# Likelihood ratio test=33.3 on 7 df, p=2.36e-05 n= 432, number of events= 114
Note that the model uses fin, age, race, wexp, mar, paro, prio to predict arrest. As mentioned in this document the survfit() function uses the Kaplan-Meier estimate for the survival rate.
plot(survfit(mod.allison), ylim=c(0.7, 1), xlab="Weeks",
ylab="Proportion Not Rearrested")
We get a plot (with a 95% confidence interval) for the survival rate. For the cumulative hazard rate you can do
# plot(survfit(mod.allison)$cumhaz)
but this doesn't give confidence intervals. However, no worries! We know that H(t) = -ln(S(t)) and we have confidence intervals for S(t). All we need to do is
sfit <- survfit(mod.allison)
cumhaz.upper <- -log(sfit$upper)
cumhaz.lower <- -log(sfit$lower)
cumhaz <- sfit$cumhaz # same as -log(sfit$surv)
Then just plot these
plot(cumhaz, xlab="weeks ahead", ylab="cumulative hazard",
ylim=c(min(cumhaz.lower), max(cumhaz.upper)))
lines(cumhaz.lower)
lines(cumhaz.upper)
You'll want to use survfit(..., conf.int=0.50) to get bands for 75% and 25% instead of 97.5% and 2.5%.
The request for estimated survival curve at the 25th and 75th percentiles for X first requires determining those percentiles and specifying values for all the other covariates in a dataframe to be used as newdata argument to survfit.:
Can use the data suggested by other resondent from Fox's website, although on my machine it required building an url-object:
url <- url("http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt")
Rossi <- read.table(url, header=TRUE)
It's probably not the best example for this wquestion but it does have a numeric variable that we can calculate the quartiles:
> summary(Rossi$prio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 2.984 4.000 18.000
So this would be the model fit and survfit calls:
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + prio ,
data=Rossi)
prio.fit <- survfit(mod.allison,
newdata= data.frame(fin="yes", age=30, race="black", prio=c(1,4) ))
plot(prio.fit, col=c("red","blue"))
Setting the values of the confounders to a fixed value and plotting the predicted survival probabilities at multiple points in time for given values of X (as #IRTFM suggested in his answer), results in a conditional effect estimate. That is not what a standard Kaplan-Meier estimator is used for and I don't think that is what the original poster wanted. Usually we are interested in average causal effects. In other words: What would the survival probability be if X had been set to some specific value x in the entire sample?
We can obtain this probability using the cox-model that was fit plus g-computation. In g-computation, we set the value of X to x in the entire sample and then use the cox model to predict the survival probability at t for each individual, using their observed covariate values in the process. Then we simply take the average of those predictions to obtain our final estimate. By repeating this process for a range of points in time and a range of possible values for X, we obtain a three-dimensional survival surface. We can then visualize this surface using color scales.
This can be done using the contsurvplot R-package I developed, as discussed in this previous answer: Converting survival analysis by a continuous variable to categorical or in the documentation of the package. More information about this strategy in general can be found in the preprint version of my article on this topic: https://arxiv.org/pdf/2208.04644.pdf

Having issues using the lme4 predict function on my mixed models

I’m having a bit of a struggle trying to use the lme4 predict function on my mixed models. When making predications I want to be able to set some of my explanatory variables to a specified level but average across others.
Here’s some made up data that is a simplified, nonsense version of my original dataset:
a <- data.frame(
TLR4=factor(rep(1:3, each=4, times=4)),
repro.state=factor(rep(c("a","j"),each=6,times=8)),
month=factor(rep(1:2,each=8,times=6)),
sex=factor(rep(1:2, each=4, times=12)),
year=factor(rep(1:3, each =32)),
mwalkeri=(sample(0:15, 96, replace=TRUE)),
AvM=(seq(1:96))
)
The AvM number is the water vole identification number. The response variable (mwalkeri) is a count of the number of fleas on each vole. The main explanatory variable I am interested in is Tlr4 which is a gene with 3 different genotypes (coded 1, 2 and 3). The other explanatory variables included are reproductive state (adult or juvenile), month (1 or 2), sex (1 or 2) and year (1, 2 or 3). My model looks like this (of course this model is now inappropriate for the made up data but that shouldn't matter):
install.packages("lme4")
library(lme4)
mm <- glmer(mwalkeri~TLR4+repro.state+month+sex+year+(1|AvM), data=a,
family=poisson,control=glmerControl(optimizer="bobyqa"))`
summary(mm)
I want to make predictions about parasite burden for each different Tlr4 genotype while accounting for all the other covariates. To do this I created a new dataset to specify the level I wanted to set each of the explanatory variables to and used the predict function:
b <- data.frame(
TLR4=factor(1:3),
repro.state=factor(c("a","a","a")),
month=factor(rep(1, times=3)),
sex=factor(rep(1, times=3)),
year=factor(rep(1, times=3))
)
predict(mm, newdata=b, re.form=NA, type="response")
This did work but I would really prefer to average across years instead of setting year to one particular level. However, whenever I attempt to average year I get this error message:
Error in model.frame.default(delete.response(Terms), newdata, na.action = na.action, : factor year has new level
Is it possible for me to average across years instead of selecting a specified level? Also, I've not worked out how to get the standard error associated with these predictions. The only way I've been able to get standard error for predictions was using the lsmeans() function (from the lsmeans package):
c <- lsmeans(mm, "TLR4", type="response")
summary(c, type="response")
Which automatically generates the standard error. However, this is generated by averaging across all the other explanatory variables. I'm sure it’s probably possible to change that but I would rather use the predict() function if I can. My goal is to create a graph with Tlr4 genotype on the x-axis and predicted parasite burden on the y-axis to demonstrate the predicted differences in parasite burden for each genotype while all other significant covariants are accounted for.
You might be interested in the merTools package which includes a couple of functions for creating datasets of counterfactuals and then making predictions on that new data to explore the substantive impact of variables on the outcome. A good example of this comes from the README and the package vignette:
Let's take the case where we want to explore the impact of a model with an interaction term between a category and a continuous predictor. First, we fit a model with interactions:
data(VerbAgg)
fmVA <- glmer(r2 ~ (Anger + Gender + btype + situ)^2 +
(1|id) + (1|item), family = binomial,
data = VerbAgg)
Now we prep the data using the draw function in merTools. Here we draw the average observation from the model frame. We then wiggle the data by expanding the dataframe to include the same observation repeated but with different values of the variable specified by the var parameter. Here, we expand the dataset to all values of btype, situ, and Anger.
# Select the average case
newData <- draw(fmVA, type = "average")
newData <- wiggle(newData, var = "btype", values = unique(VerbAgg$btype))
newData <- wiggle(newData, var = "situ", values = unique(VerbAgg$situ))
newData <- wiggle(newData, var = "Anger", values = unique(VerbAgg$Anger))
head(newData, 10)
#> r2 Anger Gender btype situ id item
#> 1 N 20 F curse other 5 S3WantCurse
#> 2 N 20 F scold other 5 S3WantCurse
#> 3 N 20 F shout other 5 S3WantCurse
#> 4 N 20 F curse self 5 S3WantCurse
#> 5 N 20 F scold self 5 S3WantCurse
#> 6 N 20 F shout self 5 S3WantCurse
#> 7 N 11 F curse other 5 S3WantCurse
#> 8 N 11 F scold other 5 S3WantCurse
#> 9 N 11 F shout other 5 S3WantCurse
#> 10 N 11 F curse self 5 S3WantCurse
Now we simply pass this new dataset to predictInterval in order to generate predictions for these counterfactuals. Then we plot the predicted values against the continuous variable, Anger, and facet and group on the two categorical variables situ and btype respectively.
plotdf <- predictInterval(fmVA, newdata = newData, type = "probability",
stat = "median", n.sims = 1000)
plotdf <- cbind(plotdf, newData)
ggplot(plotdf, aes(y = fit, x = Anger, color = btype, group = btype)) +
geom_point() + geom_smooth(aes(color = btype), method = "lm") +
facet_wrap(~situ) + theme_bw() +
labs(y = "Predicted Probability")

Resources