Interpreting interactions in a regression model - r

A simple question I hope.
I have an experimental design where I measure some response (let's say blood pressure) from two groups: a control group and an affected group, where both are given three treatments: t1, t2, t3. The data are not paired in any sense.
Here is an example data:
set.seed(1)
df <- data.frame(response = c(rnorm(5,10,1),rnorm(5,10,1),rnorm(5,10,1),
rnorm(5,7,1),rnorm(5,5,1),rnorm(5,10,1)),
group = as.factor(c(rep("control",15),rep("affected",15))),
treatment = as.factor(rep(c(rep("t1",5),rep("t2",5),rep("t3",5)),2)))
What I am interested in is quantifying the effect that each treatment has on the affected group relative to the control group. How would I model this, say using an linear model (for example lm in R)?
Am I wrong thinking that:
lm(response ~ 0 + treatment * group, data = df)
which is equivalent to:
lm(response ~ 0 + treatment + group + treatment:group, data = df)
is not what I need? I think that in this model the treatment:group interaction terms are relative to the mean over all baseline group and baseline treatment measurements.
I therefore thought that this model:
lm(response ~ 0 + treatment:group, data = df)
is what I need but it's quantifying each combination of treatment and group interaction terms: treatmentt1:groupcontrol treatmentt1:groupaffected treatmentt2:groupcontrol treatmentt2:groupaffected treatmentt3:groupcontrol treatmentt3:groupaffected
So perhaps this model:
lm(response ~ 0 + treatment + treatment:group, data = df)
is the correct one?
Although in addition to quantifying each combination of treatment and groupaffected interaction term it's also quantifying the effect of each treatment. I'm not sure what is the baseline each of the treatment and groupaffected interaction terms are compared to in this model.
Help would be appreciated.
Also, let's say I ran a fourth treatment which is actually the combination of two treatments, say t1+t3, where I don't know what the expectation of their combined effect is: additive/subtractive or synergistic. Is there any way this can be combined?

The interaction term tells you that the difference between groups is dependent on treatment, that is, that the difference between affected and control is not the same for t1, t2 and t3.
I would model the intercept though.
lm(response ~ group + treatment + group:treatment, data=df)
After getting a significant interaction term I would use t.tests to further investigate and to help with interpretation.
As can be seen the interaction is driven by the larger effect of t2 relative to the others.
library(data.table)
library(dplyr)
library(ggplot2)
set.seed(1)
df <- data.frame(response = c(rnorm(5,10,1),rnorm(5,10,1),rnorm(5,10,1),rnorm(5,7,1),rnorm(5,5,1),rnorm(5,10,1)),
group = as.factor(c(rep("control",15),rep("affected",15))),
treatment = as.factor(rep(c(rep("t1",5),rep("t2",5),rep("t3",5)),2)))
# t tests of the desired comparisons to see if there is a difference and get 95% confidence intervals
t.test(df$response[df$treatment=="t1"] ~ df$group[df$treatment=="t1"])
t.test(df$response[df$treatment=="t2"] ~ df$group[df$treatment=="t2"])
t.test(df$response[df$treatment=="t3"] ~ df$group[df$treatment=="t3"])
# plot 95% C.I.
ci_plot <- matrix(nrow=3, ncol=3)
ci_plot <- as.data.frame(ci_plot)
colnames(ci_plot) <- c("treatment", "lci", "uci")
ci_plot[,1] <- c("t1", "t2", "t3")
ci_plot[,3] <- c(t.test(df$response[df$treatment=="t1"] ~ df$group[df$treatment=="t1"])$conf.int[1],
t.test(df$response[df$treatment=="t2"] ~ df$group[df$treatment=="t2"])$conf.int[1],
t.test(df$response[df$treatment=="t3"] ~ df$group[df$treatment=="t3"])$conf.int[1])
ci_plot[,4] <- c(t.test(df$response[df$treatment=="t1"] ~ df$group[df$treatment=="t1"])$conf.int[2],
t.test(df$response[df$treatment=="t2"] ~ df$group[df$treatment=="t2"])$conf.int[2],
t.test(df$response[df$treatment=="t3"] ~ df$group[df$treatment=="t3"])$conf.int[2])
ggplot(ci_plot, aes(x=treatment, y=uci)) +
geom_errorbar(aes(ymin=uci, ymax=lci), width=0.5, position=position_dodge(0.9), weight=0.5) +
xlab("Treatment") +
ylab("Change in mean relative to control (95% C.I.)") +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_text(angle = 90, hjust = 1))

Your first specification is fine.
lm(response ~ 0 + treatment * group, data = df)
Call:
lm(formula = response ~ 0 + treatment * group, data = df)
Coefficients:
treatmentt1 treatmentt2 treatmentt3
7.460 5.081 9.651
groupcontrol treatmentt2:groupcontrol treatmentt3:groupcontrol
2.670 2.384 -2.283
The first coefficient, 7.460, represents the effect that occurs when a participant is both treated with t1 and affected. Going from left to right, the second coefficient, 5.081, represents when a participant is both treated with t2 and affected, etc...
So for example, when a participant is treated with t2 and in the control the effect is 5.081 + 2.384.
If I were doing this analysis, I would keep the intercept.
Call:
lm(formula = response ~ treatment * group, data = df)
Coefficients:
(Intercept) treatmentt2 treatmentt3
7.460 -2.378 2.192
groupcontrol treatmentt2:groupcontrol treatmentt3:groupcontrol
2.670 2.384 -2.283
Now the second coefficient, going from left to right, represents the effect of participants treated with t2 and affected relative to participants treated with t1 and affected. To see this notice that 7.460 - 2.378 = 5.081 (the second coefficient in the first specification). I like this approach because it makes it easier to interpret the relative effects.
That all being said #MrFlick is right. This is a question for Cross Validation.

Related

LMM on Chickweight data

I would like to write a model with random intercepts and random slopes with respect to time. I am not sure if my code is correct.
model4<-lmer(weight~Time + Diet + Time*Diet + (1+Time|Chick), data = Data, REML = TRUE)
summary(model4)
Yes, that is the correct specification for those random effects. You can check this out, by applying a similar model, but temporarily removing the fixed effect on diet and the interaction between time and diet
model4<-lmer(weight~Time + (1+Time|Chick), data = ChickWeight, REML = TRUE)
Column bind the original data, plus predictions from this simple model above, and select five random Chicks to plot
weight_hat = predict(model4)
cw = cbind(ChickWeight,weight_hat)
random_chicks = sample(unique(cw$Chick),5)
ggplot(cw[cw$Chick %in% random_chicks,], aes(Time, color=Chick)) +
geom_point(aes(y=weight), size=2) +
geom_line(aes(y=weight_hat), size=1.5) +
theme(legend.position="bottom")+
guides(color=guide_legend(nrow=1))
You can see that the intercept and slope for each Chick differs.

Accelerated Failure Time modelling: plotting survival probabilities with CI-s (example provided)

As proportional hazards assumption is violated with my real data, I am using an AFT model, trying to calculate adjusted survival probabilities for study groups in interest. The example below is on kidney data and I tried to follow ciTools vignette.
library(tidyverse)
library(ciTools)
library(here)
library(survival)
library(survminer)
#data
kidney
Model
fit1 = survreg(Surv(time, censored) ~ sex + age + disease, data = kidney)
Call:
survreg(formula = Surv(time, censored) ~ sex + age + disease,
data = kidney)
Coefficients:
(Intercept) sexfemale age diseaseGN diseaseAN diseasePKD
8.41830937 -0.93959839 -0.01578812 -0.25274448 -0.38306425 -0.32830433
Scale= 1.642239
Loglik(model)= -122.1 Loglik(intercept only)= -122.7
Chisq= 1.33 on 5 degrees of freedom, p= 0.931
n= 76
Adding survival probabilities for both sexes for surviving at least 365 days
probs = ciTools::add_probs(kidney, fit1, q = 365,
name = c("prob", "lcb", "ucb"),
comparison = ">")
probs
Trying to plot one-year survival probabilities for both sexes, but there are multiple point estimates for geom_point?
It seems for me that these point estimates are given for each age value. Can I edit the prediction so that it is made for mean or median age?
probs %>% ggplot(aes(x = prob, y = sex)) +
ggtitle("1-year survival probability") +
xlim(c(0,1)) +
theme_bw() +
geom_point(aes(x = prob), alpha = 0.5, colour = "red")+
geom_linerange(aes(xmin = lcb, xmax = ucb), alpha = 0.5)
However, this approach seems to work with a simple model
fit2 = survreg(Surv(time, censored) ~ sex, data = kidney)
probs2 = ciTools::add_probs(kidney, fit2, q = 365,
name = c("prob", "lcb", "ucb"),
comparison = ">")
probs2 %>% ggplot(aes(x = prob, y = sex)) +
ggtitle("1-year survival probability") +
xlim(c(0,1)) +
theme_bw() +
geom_point(aes(x = prob), alpha = 0.5, colour = "red")+
geom_linerange(aes(xmin = lcb, xmax = ucb), alpha = 0.5)
Questions:
How can I get adjusted survival probabilities for both sexes? Or if this is impossible, what would be possible alternatives? Code would help with alternatives.
If I would like to get adjusted survival probabilities for both sexes and for different time points, should I edit "q" value in ciTools::add_probs() function? For example: q = 30 for one month; q = 90 for three months etc. Or I should run a separate model for each time period?
The way you have set up these models, the predictions are being returned for each individual in the kidney data set, based on the covariate values included in the model.
In your first model, you have included sex + age + disease so that you get a prediction for each combination of those 3 covariate values in your data set.
In the second model, you have only included sex as a predictor, so you only get predictions based on sex.
Survival model prediction functions allow you to specify a set of covariate values from which to predict in a new data frame. According to the manual for add_probs.survreg, you do so by specifying a new data frame with specified covariate values in the df argument to the function. You used the kidney data frame there, so you got predictions for all those cases.
I'm not familiar with ciTools::add_probs specifically, but such software typically will (without warning) accept your values for the covariates you specify and then use some type of "average" value for the covariates that you don't specify. As "averages" don't have much meaning for categorical covariates like disease, it's usually better to specify a complete useful set of values for all covariates yourself.
The functions in the rms package in R often do a better job at providing useful predictions, as they choose typical rather than "average" values for unspecified covariates, based on an initial evaluation of the data set by a datadist() function whose output you then specify as a system option. The learning curve for this package is a bit steep but well worth it if you will be doing a lot of survival or other regression modeling.

How do I interpret my categorical coefficient in my mixed-effects linear model in R?

I would like to know how to interpret my coefficient 'Diet' in this multi-level model. The 'Diet' category is 1-4, and refers to what diet the chick is on. Time is in days, and weight is in grams. So the chicks all increase in weight over time, but at different rates due to different diets. 'Chick' is a chick's unique ID.
Using the code below you should get the MLE estimates/coefficients as intercept :23.018, Time : 8.443 and Diet: 2.979
I can see that as time increases 1 unit, weight increase by 8.443. However how can this be true for Diet, being a categorical variable, when '3' leads to more weight increase than '4'? (I know this from plotting the data, see code below).
Perhaps it is a modelling problem and I'm doing something wrong. Does the diet variable need to be text in nature, so R dummy codes it?
Info about the data is here if you need it: http://vincentarelbundock.github.io/Rdatasets/doc/datasets/ChickWeight.html
Thanks.
library(tidyverse)
library(lme4)
library(lmerTest)
chickdiet <- read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/ChickWeight.csv')
chickm3 <- lmer(weight ~ Time + Diet + (Time | Chick), data = chickdiet)
summary(chickm3)
#from plotting the data with the code below I can see that the diet that increases the chicks' weight the most, in ascending order are 1, 2, 4, 3
ggplot(chickdiet, aes(x = Time, y = weight, colour = as.factor(Diet))) + geom_point() +
stat_smooth(method = lm, se = F) + theme_minimal()

How to add weight to variable for GAM model?

I am running a gam model based on a large dataset with many variables. My response variable is the level of "recruitment" by a herd every fall/autumn. This is calculated by the fawn:female ratio every fall/autumn over a 60 year period.
My problem is that there are many years and study sites where only between 1 - 10 females are recorded. This means that the robustness of the ratio is not trustworthy. For example if one female and one fawn is seen, it has a recruitment of 100%, but if they see one more female, that drops by 50%!
I need to tell the model that years/study sites with smaller sample sizes should be weighted less than those with larger sample sizes as these smaller sample sizes are no doubt affecting the results.
Above is a table of the females observed every year and a histogram of the same.
My model is as follows:
gamFIN <- gam(Fw.FratioFall
~ s(year)
+ s(percentage_woody_coverage)
+ s(kmRoads.km2)
+ s(WELLS_ACTIVEinsideD)
+ s(d3)
+ s(WT_DEER_springsurveys)
+ s(BadlandsCoyote.1000_mi)
+ s(Average_mintemp_winter, BadlandsCoyote.1000_mi)
+ s(BadlandsCoyote.1000_mi, WELLS_ACTIVEinsideD)
+ s(BadlandsCoyote.1000_mi, d3)
+ s(YEAR, bs = "re") + s(StudyArea, bs = "re"), method = "REML", select = T, data = mydata)
How might I tell the model to weight my response variable by the sample sizes they are based on.
Do not model this as a ratio for your outcome. Instead model the fawn counts as your outcome and model the female counts via an offset() term using logged values on the RHS of the formula. You should be offsetting with the log of the fawn count. So the formula would look like this:
Fawns
~ s(year)
+ all_those_smooth_terms
+ offset( lnFemale_counts)
The gam models have an implicit log link which is the reason for the logging of the Female counts.
Edit (Gavin's correct. The default for gam is not a linear link):
gamFIN <- gam(FawnFall ~ s(year) + s(percentage_woody_coverage) + s(kmRoads.km2) +
s(WELLS_ACTIVEinsideD) + s(d3) + s(WT_DEER_springsurveys) +
s(BadlandsCoyote.1000_mi) + s(Average_mintemp_winter, BadlandsCoyote.1000_mi) +
s(BadlandsCoyote.1000_mi, WELLS_ACTIVEinsideD) + s(BadlandsCoyote.1000_mi, d3) +
s(YEAR, bs = "re") + s(StudyArea, bs = "re") + offset(FemaleFall),
family="poisson", method = "REML", select = T, data = mydata)

Plotting a multiple logistic regression for binary and continuous values in R

I have a data frame of mammal genera. Each row of the column is a different genus. There are three columns: a column of each genus's geographic range size (a continuous variable), a column stating whether or not a genus is found inside or outside of river basins (a binary variable), and a column stating whether the genus is found in the fossil record (a binary variable).
I have performed a multiple logistic regression to see if geographic range size and presence in/out of basins is a predictor of presence in the fossil record using the following R code.
Regression<-glm(df[ ,"FossilRecord"] ~ log(df[ ,"Geographic Range"]) + df[ ,"Basin"], family="binomial")
I am trying to find a way to visually summarize the output of this regression (other than a table of the regression summary).
I know how to do this for a single variable regression. For example, I could use a plot like if I wanted to see the relationship between just geographic range size and presence in the fossil record.
However, I do not know how to make a similar or equivalent plot when there are two independent variables, and one of them is binary. What are some plotting and data visualization techniques I could use in this case?
Thanks for the help!
Visualization is important and yet it can be very hard. With your example, I would recommend plotting one line for predicted FossilRecord versus GeographicRange for each level of your categorical covariate (Basin). Here's an example of how to do it with the ggplot2 package
##generating data
ssize <- 100
set.seed(12345)
dat <- data.frame(
Basin = rbinom(ssize, 1,.4),
GeographicRange = rnorm(ssize,10,2)
)
dat$FossilRecord = rbinom(ssize,1,(.3 + .1*dat$Basin + 0.04*dat$GeographicRange))
##fitting model
fit <- glm(FossilRecord ~ Basin + GeographicRange, family=binomial(), data=dat)
We can use the predict() function to obtain predicted response values for many GeographicRange values and for each Basin category.
##getting predicted response from model
plotting_dfm <- expand.grid(GeographicRange = seq(from=0, to = 20, by=0.1),
Basin = (0:1))
plotting_dfm$preds <- plogis( predict(fit , newdata=plotting_dfm))
Now you can plot the predicted results:
##plotting the predicted response on the two covariates
library(ggplot2)
pl <- ggplot(plotting_dfm, aes(x=GeographicRange, y =preds, color=as.factor(Basin)))
pl +
geom_point( ) +
ggtitle("Predicted FossilRecord by GeoRange and Basin") +
ggplot2::ylab("Predicted FossilRecord")
This will produce a figure like this:
You can plot a separate curve for each value of the categorical variable. You didn't provide sample data, so here's an example with another data set:
library(ggplot2)
# Data
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
# Model. gre is continuous. rank has four categories.
m1 = glm(admit ~ gre + rank, family=binomial, data=mydata)
# Predict admit probability
newdata = expand.grid(gre=seq(200,800, length.out=100), rank=1:4)
newdata$prob = predict(m1, newdata, type="response")
ggplot(newdata, aes(gre, prob, color=factor(rank), group=rank)) +
geom_line()
UPDATE: To respond to #Provisional.Modulation's comment: There are lots of options, depending on what you want to highlight and what is visually clear enough to understand, given your particular data and model output.
Here's an example using the built-in mtcars data frame and a logistic regression with one categorical and two continuous predictor variables:
m1 = glm(vs ~ cyl + mpg + hp, data=mtcars, family=binomial)
Now we create a new data frame with the unique values of cyl, five quantiles of hp and a continuous sequence of mpg, which we'll put on the x-axis (you could also of course do quantiles of mpg and use hp as the x-axis variable). If you have many continuous variables, you may need to set some of them to a single value, say, the median, when you graph the relationships between other variables.
newdata = with(mtcars, expand.grid(cyl=unique(cyl),
mpg=seq(min(mpg),max(mpg),length=20),
hp = quantile(hp)))
newdata$prob = predict(m1, newdata, type="response")
Here are three potential graphs, with varying degrees of legibility.
ggplot(newdata, aes(mpg, prob, colour=factor(cyl))) +
geom_line() +
facet_grid(. ~ hp)
ggplot(newdata, aes(mpg, prob, colour=factor(hp), linetype=factor(cyl))) +
geom_line()
ggplot(newdata, aes(mpg, prob, colour=factor(hp))) +
geom_line() +
facet_grid(. ~ cyl)
And here's another approach using geom_tile to include two continuous dimensions in each plot panel.
newdata = with(mtcars, expand.grid(cyl=unique(cyl),
mpg=seq(min(mpg),max(mpg),length=100),
hp =seq(min(hp),max(hp),length=100)))
newdata$prob = predict(m1, newdata, type="response")
ggplot(newdata, aes(mpg, hp, fill=prob)) +
geom_tile() +
facet_grid(. ~ cyl) +
scale_fill_gradient2(low="red",mid="yellow",high="blue",midpoint=0.5,
limits=c(0,1))
If you're looking for a canned solution, the visreg package might work for you.
An example using #eipi10 's data
library(visreg)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
m1 = glm(admit ~ gre + rank, family=binomial, data=mydata)
visreg(m1, "admit", by = "rank")
Many more options described in documentation.

Resources