Visualising a three way interaction between two continuous variables and one categorical variable in R - r

I have a model in R that includes a significant three-way interaction between two continuous independent variables IVContinuousA, IVContinuousB, IVCategorical and one categorical variable (with two levels: Control and Treatment). The dependent variable is continuous (DV).
model <- lm(DV ~ IVContinuousA * IVContinuousB * IVCategorical)
You can find the data here
I am trying to find out a way to visualise this in R to ease my interpretation of it (perhaps in ggplot2?).
Somewhat inspired by this blog post I thought that I could dichotomise IVContinuousB into high and low values (so it would be a two-level factor itself:
IVContinuousBHigh <- mean(IVContinuousB) + sd (IVContinuousB)
IVContinuousBLow <- mean(IVContinuousB) - sd (IVContinuousB)
I then planned to plot the relationship between DV and IV ContinuousA and fit lines representing the slopes of this relationship for different combinations of IVCategorical and my new dichotomised IVContinuousB:
IVCategoricalControl and IVContinuousBHigh
IVCategoricalControl and IVContinuousBLow
IVCategoricalTreatment and IVContinuousBHigh
IVCategoricalTreatment and IVContinuousBLow
My first question is does this sound like a viable solution to producing an interpretable plot of this three-way-interaction? I want to avoid 3D plots if possible as I don't find them intuitive... Or is there another way to go about it? Maybe facet plots for the different combinations above?
If it is an ok solution, my second question is how to I generate the data to predict the fit lines to represent the different combinations above?
Third question - does anyone have any advice as to how to code this up in ggplot2?
I posted a very similar question on Cross Validated but because it is more code related I thought I would try here instead (I will remove the CV post if this one is more relevant to the community :) )
Thanks so much in advance,
Sarah
Note that there are NAs (left as blanks) in the DV column and the design is unbalanced - with slightly different numbers of datapoints in the Control vs Treatment groups of the variable IVCategorical.
FYI I have the code for visaualising a two-way interaction between IVContinuousA and IVCategorical:
A<-ggplot(data=data,aes(x=AOTAverage,y=SciconC,group=MisinfoCondition,shape=MisinfoCondition,col = MisinfoCondition,))+geom_point(size = 2)+geom_smooth(method='lm',formula=y~x)
But what I want is to plot this relationship conditional on IVContinuousB....

Here are a couple of options for visualizing the model output in two dimensions. I'm assuming here that the goal here is to compare Treatment to Control
library(tidyverse)
theme_set(theme_classic() +
theme(panel.background=element_rect(colour="grey40", fill=NA))
dat = read_excel("Some Data.xlsx") # I downloaded your data file
mod <- lm(DV ~ IVContinuousA * IVContinuousB * IVCategorical, data=dat)
# Function to create prediction grid data frame
make_pred_dat = function(data=dat, nA=20, nB=5) {
nCat = length(unique(data$IVCategorical))
d = with(data,
data.frame(IVContinuousA=rep(seq(min(IVContinuousA), max(IVContinuousA), length=nA), nB*2),
IVContinuousB=rep(rep(seq(min(IVContinuousB), max(IVContinuousB), length=nB), each=nA), nCat),
IVCategorical=rep(unique(IVCategorical), each=nA*nB)))
d$DV = predict(mod, newdata=d)
return(d)
}
IVContinuousA vs. DV by levels of IVContinuousB
The roles of IVContinuousA and IVContinuousB can of course be switched here.
ggplot(make_pred_dat(), aes(x=IVContinuousA, y=DV, colour=IVCategorical)) +
geom_line() +
facet_grid(. ~ round(IVContinuousB,2)) +
ggtitle("IVContinuousA vs. DV, by Level of IVContinousB") +
labs(colour="")
You can make a similar plot without faceting, but it gets difficult to interpret as the number of IVContinuousB levels increases:
ggplot(make_pred_dat(nB=3),
aes(x=IVContinuousA, y=DV, colour=IVCategorical, linetype=factor(round(IVContinuousB,2)))) +
geom_line() +
#facet_grid(. ~ round(IVContinuousB,2)) +
ggtitle("IVContinuousA vs. DV, by Level of IVContinousB") +
labs(colour="", linetype="IVContinuousB") +
scale_linetype_manual(values=c("1434","11","62")) +
guides(linetype=guide_legend(reverse=TRUE))
Heat map of the model-predicted difference, DV treatment - DV control on a grid of IVContinuousA and IVContinuousB values
Below, we look at the difference between treatment and control at each pair of IVContinuousA and IVContinuousB.
ggplot(make_pred_dat(nA=100, nB=100) %>%
group_by(IVContinuousA, IVContinuousB) %>%
arrange(IVCategorical) %>%
summarise(DV = diff(DV)),
aes(x=IVContinuousA, y=IVContinuousB)) +
geom_tile(aes(fill=DV)) +
scale_fill_gradient2(low="red", mid="white", high="blue") +
labs(fill=expression(Delta*DV~(Treatment - Control)))

If you really want to avoid 3-d plotting, you could indeed turn one of the continuous variables into a categorical one for visualization purposes.
For the purpose of the answer, I used the Duncan data set from the package car, as it is of the same form as the one you described.
library(car)
# the data
data("Duncan")
# the fitted model; education and income are continuous, type is categorical
lm0 <- lm(prestige ~ education * income * type, data = Duncan)
# turning education into high and low values (you can extend this to more
# levels)
edu_high <- mean(Duncan$education) + sd(Duncan$education)
edu_low <- mean(Duncan$education) - sd(Duncan$education)
# the values below should be used for predictions, each combination of the
# categories must be represented:
prediction_mat <- data.frame(income = Duncan$income,
education = rep(c(edu_high, edu_low),each =
nrow(Duncan)),
type = rep(levels(Duncan$type), each =
nrow(Duncan)*2))
predicted <- predict(lm0, newdata = prediction_mat)
# rearranging the fitted values and the values used for predictions
df <- data.frame(predicted,
income = Duncan$income,
edu_group =rep(c("edu_high", "edu_low"),each = nrow(Duncan)),
type = rep(levels(Duncan$type), each = nrow(Duncan)*2))
# plotting the fitted regression lines
ggplot(df, aes(x = income, y = predicted, group = type, col = type)) +
geom_line() +
facet_grid(. ~ edu_group)

Related

Kaplan-Meier Survival curve with 3 Plots

I am very new to R studio, and I am currently learning how to do Kaplan-Meier survival curves.
Here are the columns that are needed and the information the columns contain:
Group: “normal”, “high”
Response: “responder” “non-responder”
Days: this is the time variable
Outcome: “0”, “1” (0 = censored & 1 = event)
I’m trying to plot 3 plots on one survival curve (Kaplan-Meier curve). I want to plot Normal (regardless of responder status) vs High Responder vs High Non-responder). Is there a way to do this?
I tried making subsets of data (one including only those that are “normal” and then another containing only those that were “high”) so that I could plot the normal (1 plot)(regardless of responder status) and the other subset could be used to plot the high responder vs high nonrepsonder (2 plots) but then I got stuck on how to combine them.
After you've created the first plot you can add the other lines over top of the first plot. Here's an example doing that with the built-in cancer dataset. I make the people aged 60 or older all in one group, but split the people under 60 into two groups based on their sex.
library(survival)
dat_60plus <- cancer[cancer$age >= 60,]
dat_under60 <- cancer[cancer$age < 60,]
mod_60plus <- survfit(Surv(time, status) ~ 1, data = dat_60plus)
mod_under60sex <- survfit(Surv(time, status) ~ sex, data = dat_under60)
plot(mod_60plus, conf.int = FALSE)
lines(mod_under60sex, conf.int = FALSE, col = c("blue", "red"))
Alternatively you could create a new variable which identifies which group each individual is in and make a single model based on that variable. I think this is conceptually a bit simpler, so long as creating the variable isn't too complicated.
dat_all <- cancer
dat_all$myvar <- factor(ifelse(cancer$age >= 60,
"60plus",
ifelse(cancer$sex == 1,
"under60male",
"under60female")))
mod_all <- survfit(Surv(time, status) ~ myvar, data = dat_all)
plot(mod_all, conf.int = FALSE, col = c("black", "red", "blue"))
Created on 2023-01-25 with reprex v2.0.2
A potential downside of the first method is that if one of the groups has much longer survival time than the others, then you might have to do more adjusting of the graphical settings to make the plot look good. The second method should have more automatic checks for that.

compare different multi level regressions

i am struggeling at the following:
My idea is to analyse the development (slope) of an output of different multi level regressions.
The output is matched in my data with 2 different timepoints.
I have 3 predictors (senseofhumor, seriousness, friendlyness)
These predictors are meassured for many people and groups.
And is assume here, that SenseofhumorHIGH (as a special value variable from "senseofhumor" ) might have an impact if its high within a group on the outcome. I also assume the slope might first increase dramatically and than increase slower.
How can I compare different slopes with from different regressions with each other?
How is the best way to visualize this slopes?
The code would look something like that:
RandomslopeEC(timepoint1) <- lme(criteria(timepoint1) ~ senseofhumor + seriousness + friendlyness , data = DATA, random = ~ **SenseofhumorHIGH**|group)
RandomslopeEC(timepoint2) <- lme(criteria(timepoint2) ~ senseofhumor + seriousness + friendlyness , data = DATA, random = ~ **SenseofhumorHIGH**|group)
RandomslopeEC(timepoint3) <- lme(criteria(timepoint3) ~ senseofhumor + seriousness + friendlyness , data = DATA, random = ~ **SenseofhumorHIGH**|group)
Thanks a lot in advance
it worked out with changing the format from wide to long.
I used:
DATAlong<- DATA %>%
gather(`criteriatimepoint1`, `criteriatimepoint2`, `criteriatimepoint3`, key = "timepoint", value = "criteriavalue")
for that.
Afterwards i used
RandomslopeEC <- lme(criteria) ~ senseofhumor*timepoint + seriousness*timepoint + friendlyness*timepoint , data = DATAlong, random = ~ 1|group/timepoint)
for that.
I hope this might others help as well.

R: plotting actual vs observed with mixed effects model

I have a data set:
date=c(56,54,112,230,250,134,114)
species=c("pink","blue","pink","green","black","orange","purple")
year=c(1901,2000,1958,1978,1992,1992,1994)
loc=c("forest","river","river","cloud","cloud")
peop=c(1.0,-6.2,1.55,0.45,-2.8,3.45,4.1)
per=c(1,5,63,9,45,1,2)
tem=c(12,65,14,35,26,24,22)
high=c(2500,3400,2600,2800,2546,2148,3654)
From this data set I created a mixed effects model:
model<-lmer(date~(1|species)+ high + year + tem*peop + per)
I need a graph that plots the actual observed values for date vs the predicted ones by the model. Thanks!
Construct data (it's better to have data in a data frame rather than lying around in your global workspace):
dd <-
data.frame(date=c(56,54,112,230,250,134,114),
species=c("pink","blue","pink","green","black",
"orange","purple"),
year=c(1901,2000,1958,1978,1992,1992,1994),
## I added a couple of values here since the
## length didn't match the other variables
loc=c("forest","river","river","cloud","cloud","forest","forest"),
peop=c(1.0,-6.2,1.55,0.45,-2.8,3.45,4.1),
per=c(1,5,63,9,45,1,2),
tem=c(12,65,14,35,26,24,22),
high=c(2500,3400,2600,2800,2546,2148,3654))
This model can't actually be fit with a data set this short, so I replicated it (still very artificial, but OK for illustration)
dd <- do.call(rbind,replicate(10,dd,simplify=FALSE))
library("lme4")
model<-lmer(date~(1|species)+ high + year + tem*peop + per,dd)
Plot expected (x-axis) vs observed (y axis):
plot(fitted(model),dd$date)
abline(a=0,b=1) ## 1-to-1 line

Plot "regression line" from multiple regression in R

I ran a multiple regression with several continuous predictors, a few of which came out significant, and I'd like to create a scatterplot or scatter-like plot of my DV against one of the predictors, including a "regression line". How can I do this?
My plot looks like this
D = my.data; plot( D$probCategorySame, D$posttestScore )
If it were simple regression, I could add a regression line like this:
lmSimple <- lm( posttestScore ~ probCategorySame, data=D )
abline( lmSimple )
But my actual model is like this:
lmMultiple <- lm( posttestScore ~ pretestScore + probCategorySame + probDataRelated + practiceAccuracy + practiceNumTrials, data=D )
I would like to add a regression line that reflects the coefficient and intercept from the actual model instead of the simplified one. I think I'd be happy to assume mean values for all other predictors in order to do this, although I'm ready to hear advice to the contrary.
This might make no difference, but I'll mention just in case, the situation is complicated slightly by the fact that I probably will not want to plot the original data. Instead, I'd like to plot mean values of the DV for binned values of the predictor, like so:
D[,'probCSBinned'] = cut( my.data$probCategorySame, as.numeric( seq( 0,1,0.04 ) ), include.lowest=TRUE, right=FALSE, labels=FALSE )
D = aggregate( posttestScore~probCSBinned, data=D, FUN=mean )
plot( D$probCSBinned, D$posttestScore )
Just because it happens to look much cleaner for my data when I do it this way.
To plot the individual terms in a linear or generalised linear model (ie, fit with lm or glm), use termplot. No need for binning or other manipulation.
# plot everything on one page
par(mfrow=c(2,3))
termplot(lmMultiple)
# plot individual term
par(mfrow=c(1,1))
termplot(lmMultiple, terms="preTestScore")
You need to create a vector of x-values in the domain of your plot and predict their corresponding y-values from your model. To do this, you need to inject this vector into a dataframe comprised of variables that match those in your model. You stated that you are OK with keeping the other variables fixed at their mean values, so I have used that approach in my solution. Whether or not the x-values you are predicting are actually legal given the other values in your plot should probably be something you consider when setting this up.
Without sample data I can't be sure this will work exactly for you, so I apologize if there are any bugs below, but this should at least illustrate the approach.
# Setup
xmin = 0; xmax=10 # domain of your plot
D = my.data
plot( D$probCategorySame, D$posttestScore, xlim=c(xmin,xmax) )
lmMultiple <- lm( posttestScore ~ pretestScore + probCategorySame + probDataRelated + practiceAccuracy + practiceNumTrials, data=D )
# create a dummy dataframe where all variables = their mean value for each record
# except the variable we want to plot, which will vary incrementally over the
# domain of the plot. We need this object to get the predicted values we
# want to plot.
N=1e4
means = colMeans(D)
dummyDF = t(as.data.frame(means))
for(i in 2:N){dummyDF=rbind(dummyDF,means)} # There's probably a more elegant way to do this.
xv=seq(xmin,xmax, length.out=N)
dummyDF$probCSBinned = xv
# if this gives you a warning about "Coercing LHS to list," use bracket syntax:
#dummyDF[,k] = xv # where k is the column index of the variable `posttestScore`
# Getting and plotting predictions over our dummy data.
yv=predict(lmMultiple, newdata=subset(dummyDF, select=c(-posttestScore)))
lines(xv, yv)
Look at the Predict.Plot function in the TeachingDemos package for one option to plot one predictor vs. the response at a given value of the other predictors.

geom_smooth on a subset of data

Here is some data and a plot:
set.seed(18)
data = data.frame(y=c(rep(0:1,3),rnorm(18,mean=0.5,sd=0.1)),colour=rep(1:2,12),x=rep(1:4,each=6))
ggplot(data,aes(x=x,y=y,colour=factor(colour)))+geom_point()+ geom_smooth(method='lm',formula=y~x,se=F)
As you can see the linear regression is highly influenced by the values where x=1.
Can I get linear regressions calculated for x >= 2 but display the values for x=1 (y equals either 0 or 1).
The resulting graph would be exactly the same except for the linear regressions. They would not "suffer" from the influence of the values on abscisse = 1
It's as simple as geom_smooth(data=subset(data, x >= 2), ...). It's not important if this plot is just for yourself, but realize that something like this would be misleading to others if you don't include a mention of how the regression was performed. I'd recommend changing transparency of the points excluded.
ggplot(data,aes(x=x,y=y,colour=factor(colour)))+
geom_point(data=subset(data, x >= 2)) + geom_point(data=subset(data, x < 2), alpha=.2) +
geom_smooth(data=subset(data, x >= 2), method='lm',formula=y~x,se=F)
The regular lm function has a weights argument which you can use to assign a weight to a particular observation. In this way you can plain with the influence which the observation has on the outcome. I think this is a general way of dealing with the problem in stead of subsetting the data. Of course, assigning weights ad hoc does not bode well for the statistical soundness of the analysis. It is always best to have a rationale behind the weights, e.g. low weight observations have a higher uncertainty.
I think under the hood ggplot2 uses the lm function so you should be able to pass the weights argument. You can add the weights through the aesthetic (aes), assuming that the weight is stored in a vector:
ggplot(data,aes(x=x,y=y,colour=factor(colour))) +
geom_point()+ stat_smooth(aes(weight = runif(nrow(data))), method='lm')
you could also put weight in a column in the dataset:
ggplot(data,aes(x=x,y=y,colour=factor(colour))) +
geom_point()+ stat_smooth(aes(weight = weight), method='lm')
where the column is called weight.
I tried #Matthew Plourde's solution, but subset did not work for me. It did not change anything when I used the subset compared to the original data. I replaced subset with filter and it worked:
ggplot(data,aes(x=x,y=y,colour=factor(colour)))+
geom_point(data=data[data$x >= 2,]) + geom_point(data=data[data$x < 2,], alpha=.2) +
geom_smooth(data=data[data$x >= 2,], method='lm',formula=y~x,se=F)

Resources