R: plotting actual vs observed with mixed effects model - r

I have a data set:
date=c(56,54,112,230,250,134,114)
species=c("pink","blue","pink","green","black","orange","purple")
year=c(1901,2000,1958,1978,1992,1992,1994)
loc=c("forest","river","river","cloud","cloud")
peop=c(1.0,-6.2,1.55,0.45,-2.8,3.45,4.1)
per=c(1,5,63,9,45,1,2)
tem=c(12,65,14,35,26,24,22)
high=c(2500,3400,2600,2800,2546,2148,3654)
From this data set I created a mixed effects model:
model<-lmer(date~(1|species)+ high + year + tem*peop + per)
I need a graph that plots the actual observed values for date vs the predicted ones by the model. Thanks!

Construct data (it's better to have data in a data frame rather than lying around in your global workspace):
dd <-
data.frame(date=c(56,54,112,230,250,134,114),
species=c("pink","blue","pink","green","black",
"orange","purple"),
year=c(1901,2000,1958,1978,1992,1992,1994),
## I added a couple of values here since the
## length didn't match the other variables
loc=c("forest","river","river","cloud","cloud","forest","forest"),
peop=c(1.0,-6.2,1.55,0.45,-2.8,3.45,4.1),
per=c(1,5,63,9,45,1,2),
tem=c(12,65,14,35,26,24,22),
high=c(2500,3400,2600,2800,2546,2148,3654))
This model can't actually be fit with a data set this short, so I replicated it (still very artificial, but OK for illustration)
dd <- do.call(rbind,replicate(10,dd,simplify=FALSE))
library("lme4")
model<-lmer(date~(1|species)+ high + year + tem*peop + per,dd)
Plot expected (x-axis) vs observed (y axis):
plot(fitted(model),dd$date)
abline(a=0,b=1) ## 1-to-1 line

Related

Visualising a three way interaction between two continuous variables and one categorical variable in R

I have a model in R that includes a significant three-way interaction between two continuous independent variables IVContinuousA, IVContinuousB, IVCategorical and one categorical variable (with two levels: Control and Treatment). The dependent variable is continuous (DV).
model <- lm(DV ~ IVContinuousA * IVContinuousB * IVCategorical)
You can find the data here
I am trying to find out a way to visualise this in R to ease my interpretation of it (perhaps in ggplot2?).
Somewhat inspired by this blog post I thought that I could dichotomise IVContinuousB into high and low values (so it would be a two-level factor itself:
IVContinuousBHigh <- mean(IVContinuousB) + sd (IVContinuousB)
IVContinuousBLow <- mean(IVContinuousB) - sd (IVContinuousB)
I then planned to plot the relationship between DV and IV ContinuousA and fit lines representing the slopes of this relationship for different combinations of IVCategorical and my new dichotomised IVContinuousB:
IVCategoricalControl and IVContinuousBHigh
IVCategoricalControl and IVContinuousBLow
IVCategoricalTreatment and IVContinuousBHigh
IVCategoricalTreatment and IVContinuousBLow
My first question is does this sound like a viable solution to producing an interpretable plot of this three-way-interaction? I want to avoid 3D plots if possible as I don't find them intuitive... Or is there another way to go about it? Maybe facet plots for the different combinations above?
If it is an ok solution, my second question is how to I generate the data to predict the fit lines to represent the different combinations above?
Third question - does anyone have any advice as to how to code this up in ggplot2?
I posted a very similar question on Cross Validated but because it is more code related I thought I would try here instead (I will remove the CV post if this one is more relevant to the community :) )
Thanks so much in advance,
Sarah
Note that there are NAs (left as blanks) in the DV column and the design is unbalanced - with slightly different numbers of datapoints in the Control vs Treatment groups of the variable IVCategorical.
FYI I have the code for visaualising a two-way interaction between IVContinuousA and IVCategorical:
A<-ggplot(data=data,aes(x=AOTAverage,y=SciconC,group=MisinfoCondition,shape=MisinfoCondition,col = MisinfoCondition,))+geom_point(size = 2)+geom_smooth(method='lm',formula=y~x)
But what I want is to plot this relationship conditional on IVContinuousB....
Here are a couple of options for visualizing the model output in two dimensions. I'm assuming here that the goal here is to compare Treatment to Control
library(tidyverse)
theme_set(theme_classic() +
theme(panel.background=element_rect(colour="grey40", fill=NA))
dat = read_excel("Some Data.xlsx") # I downloaded your data file
mod <- lm(DV ~ IVContinuousA * IVContinuousB * IVCategorical, data=dat)
# Function to create prediction grid data frame
make_pred_dat = function(data=dat, nA=20, nB=5) {
nCat = length(unique(data$IVCategorical))
d = with(data,
data.frame(IVContinuousA=rep(seq(min(IVContinuousA), max(IVContinuousA), length=nA), nB*2),
IVContinuousB=rep(rep(seq(min(IVContinuousB), max(IVContinuousB), length=nB), each=nA), nCat),
IVCategorical=rep(unique(IVCategorical), each=nA*nB)))
d$DV = predict(mod, newdata=d)
return(d)
}
IVContinuousA vs. DV by levels of IVContinuousB
The roles of IVContinuousA and IVContinuousB can of course be switched here.
ggplot(make_pred_dat(), aes(x=IVContinuousA, y=DV, colour=IVCategorical)) +
geom_line() +
facet_grid(. ~ round(IVContinuousB,2)) +
ggtitle("IVContinuousA vs. DV, by Level of IVContinousB") +
labs(colour="")
You can make a similar plot without faceting, but it gets difficult to interpret as the number of IVContinuousB levels increases:
ggplot(make_pred_dat(nB=3),
aes(x=IVContinuousA, y=DV, colour=IVCategorical, linetype=factor(round(IVContinuousB,2)))) +
geom_line() +
#facet_grid(. ~ round(IVContinuousB,2)) +
ggtitle("IVContinuousA vs. DV, by Level of IVContinousB") +
labs(colour="", linetype="IVContinuousB") +
scale_linetype_manual(values=c("1434","11","62")) +
guides(linetype=guide_legend(reverse=TRUE))
Heat map of the model-predicted difference, DV treatment - DV control on a grid of IVContinuousA and IVContinuousB values
Below, we look at the difference between treatment and control at each pair of IVContinuousA and IVContinuousB.
ggplot(make_pred_dat(nA=100, nB=100) %>%
group_by(IVContinuousA, IVContinuousB) %>%
arrange(IVCategorical) %>%
summarise(DV = diff(DV)),
aes(x=IVContinuousA, y=IVContinuousB)) +
geom_tile(aes(fill=DV)) +
scale_fill_gradient2(low="red", mid="white", high="blue") +
labs(fill=expression(Delta*DV~(Treatment - Control)))
If you really want to avoid 3-d plotting, you could indeed turn one of the continuous variables into a categorical one for visualization purposes.
For the purpose of the answer, I used the Duncan data set from the package car, as it is of the same form as the one you described.
library(car)
# the data
data("Duncan")
# the fitted model; education and income are continuous, type is categorical
lm0 <- lm(prestige ~ education * income * type, data = Duncan)
# turning education into high and low values (you can extend this to more
# levels)
edu_high <- mean(Duncan$education) + sd(Duncan$education)
edu_low <- mean(Duncan$education) - sd(Duncan$education)
# the values below should be used for predictions, each combination of the
# categories must be represented:
prediction_mat <- data.frame(income = Duncan$income,
education = rep(c(edu_high, edu_low),each =
nrow(Duncan)),
type = rep(levels(Duncan$type), each =
nrow(Duncan)*2))
predicted <- predict(lm0, newdata = prediction_mat)
# rearranging the fitted values and the values used for predictions
df <- data.frame(predicted,
income = Duncan$income,
edu_group =rep(c("edu_high", "edu_low"),each = nrow(Duncan)),
type = rep(levels(Duncan$type), each = nrow(Duncan)*2))
# plotting the fitted regression lines
ggplot(df, aes(x = income, y = predicted, group = type, col = type)) +
geom_line() +
facet_grid(. ~ edu_group)

Combining 3 separate line plots into one line plot display

We are looking at pattern recognitions and making different variables
unusualsubjects <- rtaverages$subject_id[rtaverages$count < 5] # make a list of subjects without enough data.
rtaverages <- filter(rtaverages,!(subject_id %in% unusualsubjects)) # only include data from good subjects. ! = not. Put data from acceptable subjects right back in the same data frame
# Another example of filtering subjects: let's say we only wanted to analyze subjects with accuracies over 95%
accurateSubjects <- averages$subject_id[averages$accuracy > .95] #returns all of the subject_ids for subjects meeting an accuracy criterion
length(accurateSubjects) # tells us how many accurate subjects there are
goodSubjectdata<-filter(data,subject_id %in% accurateSubjects) # make a new data frame that contains only the data from accurate subjects
Code to conduct actual ANOVA of the Respone Times results
model<ezANOVA(data=Data,dv=rt,within=c(set_size,target_presence,task),wid=subject_id) # You need to fill in the XXXs with the correct variable names within the variable containing all of the correct RTs. conduct a repeated measures ANOVA - dv = dependent variable. within = a list of all of the within subject variables. wid = variable that is used to group data by subject
model # show results of the ANOVA model
table1 <- tapply(X=Data$rt,INDEX=list(Data$task,Data$set_size),FUN=mean,trim=0.1)#find breakdown just of setsize and task - less broken down than the above tapply code, obtained just by deleting one item from the INDEX list "INDEX=list(rtaverages$target_presence,rtaverages$set_size,rtaverages$task)" above
table1 #show means so that one can begin to interpret the data. You'll break down rtaverages in different ways to get the different mean RTs that you need for your report
par(mar = c(4,4,4,0),mfrow=c(1, 2) ) # mfrow=c(1,2) creates two plots side by side
lineplot.CI(data=filter(rtaverages,task=="conjunctive"),x.factor=set_size,group=target_presence,x.cont=TRUE,response=rt,ylim=c(0,4000),x.leg=2,xlab="Conjunctive Set Size",ylab="RT") # produces a line graph with confidence intervals
lineplot.CI(data=filter(rtaverages,task=="disjunctive"),x.factor=set_size,group=target_presence,x.cont=TRUE,response=rt,ylim=c(0,4000),x.leg=2,xlab="Disjunctive Set Size",ylab="RT") # produces a line graph with confidence intervals
Currently attempting to put 3 lines onto one plot the following way:
# The next bit of code is to reproduce Treisman and Gelade's Figure 1, including best lines of fit
rtaverages$set_size_num<-sizes[rtaverages$set_size] # added a new column to rtaverage data frame which is the numeric/continuous version of the nominal/categorical set_size factor which will be useful for predicting RT from set_size
bySetSize<-group_by(rtaverages,set_size_num,task,target_presence) #collapse even more, so all subjects' data are combined together
collapsed<-summarize(bySetSize,rt=mean(rt,trim=0.1)) # make RT summary
collapsed # show what collapsed data look like. Note that there are now only 4 (set sizes) X 2 (tasks) X 2 (present/absent trials)=16 rows
cp<-filter(collapsed,task=="conjunctive" & target_presence=="present") # plot each of the four lines separated, filtering by the right type each time
cpf<-lm(data=cp,rt ~ set_size_num) # use a linear model to predict RT from set size. Use this to get out best fitting slope (estimate for set size) and intercept
summary(cpf) # make a summary of the linear regression model. cpf stands for: conjunctive, present fit
cp3<-filter(collapsed,task=="conjunctive" & target_presence=="absent")
caf<-lm(data=cp3,rt ~ set_size_num)
summary(caf)
cp1<-filter(collapsed,task=="disjunctive" & target_presence=="present")
dpf<-lm(data=cp1,rt ~ set_size_num)
summary(dpf)
cp2<-filter(collapsed,task=="disjunctive" & target_presence=="absent")
daf<-lm(data=cp2,rt ~ set_size_num)
summary(daf)
plot(cp$set_size_num,cp$rt,ylim=c(0,4000),xlim=c(0,30),pch=19,col="green",xlab="Set Size",ylab="Response Time (msec.)") # use a big enough range to capture all of the data
abline(cpf, col="green") # add the line with the slope and intercept derived from linear model
lines(cp$set_size_num,cp3$rt,col="green")
abline(caf, col="green")
lines(cp1$set_size_num,cp1$rt,col="red")
abline(dpf, col="red")
lines(cp2$set_size_num,cp2$rt,col="red")
abline(daf, col="red")
legend(x=0,y=4000,pch=c(19,1,19,1),col=c("green","green","red","red"),cex=0.7,legend=c("Conjunctive present","Conjunctive absent","Disjunctive present","Disjunctive absent")) #Legend only should be plotted once, pch sets 4 symbols, and col sets 4 colors. cex < 1 so that legend box isn't too big
I got them to combine, but now the lines lost their format:

R: How to read Nomograms to predict the desired variable

I am using Rstudio. I have created nomograms using function nomogram from package rms using following code (copied from the example code of the documentation):
library(rms)
n <- 1000 # define sample size
set.seed(17) # so can reproduce the results
age <- rnorm(n, 50, 10)
blood.pressure <- rnorm(n, 120, 15)
cholesterol <- rnorm(n, 200, 25)
sex <- factor(sample(c('female','male'), n,TRUE))
# Specify population model for log odds that Y=1
L <- .4*(sex=='male') + .045*(age-50) +
(log(cholesterol - 10)-5.2)*(-2*(sex=='female') + 2*(sex=='male'))
# Simulate binary y to have Prob(y=1) = 1/[1+exp(-L)]
y <- ifelse(runif(n) < plogis(L), 1, 0)
ddist <- datadist(age, blood.pressure, cholesterol, sex)
options(datadist='ddist')
f <- lrm(y ~ lsp(age,50)+sex*rcs(cholesterol,4)+blood.pressure)
nom <- nomogram(f, fun=function(x)1/(1+exp(-x)), # or fun=plogis
fun.at=c(.001,.01,.05,seq(.1,.9,by=.1),.95,.99,.999),
funlabel="Risk of Death")
#Instead of fun.at, could have specified fun.lp.at=logit of
#sequence above - faster and slightly more accurate
plot(nom, xfrac=.45)
Result:
This code produces a nomogram but there is no line connecting each scale (called isopleth) to help predict the desired variable ("Risk of Death") from the plot. Usually, nomograms have the isopleth for prediction (example from wikipedia). But here, how do I predict the variable value?
EDIT:
From the documentation:
The nomogram does not have lines representing sums, but it has a
reference line for reading scoring points (default range 0--100). Once
the reader manually totals the points, the predicted values can be
read at the bottom.
I don't understand this. It seems that predicting is supposed to be done without the isopleth, from the scale of points. but how? Can someone please elaborate with this example on how I can read the nomograms to predict the desired variable? Thanks a lot!
EDIT 2 (FYI):
In the description of the bounty, I am talking about the isopleth. When starting the bounty, I did not know that nomogram function does not provide isopleth and has points scale instead.
From the documentation, the nomogram is used to manualy obtain prediction:
In the top of the plot (over Total points)
you draw a vertical line for each of the variables of your patient (for example age=40, cholesterol=220 ( and sex=male ), blood.pressure=172)
then you sum up the three values you read on the Points scale (40+60+3=103) to obtain Total Points.
Finally you draw a vertical line on the Total Points scale (103) to read the Risk of death (0.55).
These are regression nomograms, and work in a different way to classic nomograms. A classic nomogram will perform a full calculation. For these nomograms you drop a line from each predictor to the scale at the bottom and add your results.
The only way to have a classic 'isopleth' nomogram working on a regression model would be 1 have just two predictors or 2 have a complex multi- step nomogram.

Obtaining confidence interval for npreg as values, not as plot

I am using the well known "np" package of Hayfield & Racine for non-parametric regressions. It allows plotting confidence bands for the estimated coefficient based on bootstrap procedures. See the code below for an example.
Question: I am wondering how to obtain these confidence intervalls in numerical form? One, but not the only reason for this question is that I really don't like the presentation of the ci's. More generally speaking, I would like to use and further process the confidence band within my analysis.
library(np)
# generate random variables:
x <- 1:100 + rnorm(100)/2
y <- (1:100)^(0.25) + rnorm(100)/2
mynp <- npreg(y~x)
plot(mynp, plot.errors.method="bootstrap")`
when executing plot, the function is calling to the plot method of np package which is the function npplot
npplot exepts an argument plot.behavior which equals to plot by default which plots the results and returns NULL. you should set plot.behavior = "plot-data", and the function will plot and return the data of the object.
dat <- plot(mynp, plot.errors.method="bootstrap",plot.behavior = "plot-data")
Than the values in the line can be accesed through dat$r1$mean and the values to be added to the mean to get the upper and lower ci accesed through dat$r1$merr.
notice that not all value are plotted. only half of them (every other value and than the last).
read the 'help' on npplot for more options.
further is an example of the use of the code and the results:
library(np)
# generate random variables:
x <- 1:100 + rnorm(100)/2
y <- (1:100)^(0.25) + rnorm(100)/2
mynp <- npreg(y~x)
dat <- plot(mynp, plot.errors.method="bootstrap",plot.behavior = "plot-data")
Then recreating the results:
z <- unlist(dat$r1$eval,use.names = F)
CI.up = as.numeric(dat$r1$mean)+as.numeric(dat$r1$merr[,2])
CI.dn = as.numeric(dat$r1$mean)+as.numeric(dat$r1$merr[,1])
plot(dat$r1$mean~z, cex=1.5,xaxt='n', ylim=c(1.0,3.5),xlab='',ylab='lalala!', main='blahblahblah',col='blue',pch=16)
arrows(z,CI.dn,z,CI.up,code=3,length=0.2,angle=90,col='red')
we will get:
As you can see, theresults are the same (only I have calculated the intervals for each point and not only for half of them).
note the plot.errors.type attribute for npplot which gets "standard" and "quantiles" and is "standard" at default. When you specify "standard" dat$r1$merr will keep the standard errors and the plot will include mean+std err as intervals. Alternatively the plot will include the quantiles as the intervals and the quantiles will be saved at dat$r1$merr. which quntiles to use are specified by plot.errors.quantiles quantiles and it's only relevant if plot.errors.type = "quantiles"

plot multiple fit and predictions for logistic regression

I am running multiple times a logistic regression over more than 1000 samples taken from a dataset. My question is what is the best way to show my results ? how can I plot my outputs for both the fit and the prediction curve?
This is an example of what I am doing, using the baseball dataset from R. For example I want to fit and predict the model 5 times. Each time I take one sample out (for the prediction) and use another for the fit.
library(corrgram)
data(baseball)
#Exclude rows with NA values
dataset=baseball[complete.cases(baseball),]
#Create vector replacing the Leage (A our N) by 1 or 0.
PA=rep(0,dim(dataset)[1])
PA[which(dataset[,2]=="A")]=1
#Model the player be league A in function of the Hits,Runs,Errors and Salary
fit_glm_list=list()
prd_glm_list=list()
for (k in 1:5){
sp=sample(seq(1:length(PA)),30,replace=FALSE)
fit_glm<-glm(PA[sp[1:15]]~baseball$Hits[sp[1:15]]+baseball$Runs[sp[1:15]]+baseball$Errors[sp[1:15]]+baseball$Salary[sp[1:15]])
prd_glm<-predict(fit_glm,baseball[sp[16:30],c(6,8,20,21)])
fit_glm_list[[k]]=fit_glm;prd_glm_list[[k]]=fit_glm
}
There are a number of issues here.
PA is a subset of baseball$League but the model is constructed on columns from the whole baseball data frame, i.e. they do not match.
PA is treated as a continuous response when using the default family (gaussian), it should be changed to a factor and binomial family.
prd_glm_list[[k]]=fit_glm should probably be prd_glm_list[[k]]=prd_glm
You must save the true class labels for the predictions otherwise you have nothing to compare to.
My take on your code looks like this.
library(corrgram)
data(baseball)
dataset <- baseball[complete.cases(baseball),]
fits <- preds <- truths <- vector("list", 5)
for (k in 1:5){
sp <- sample(nrow(dataset), 30, replace=FALSE)
fits[[k]] <- glm(League ~ Hits + Runs + Errors + Salary,
family="binomial", data=dataset[sp[1:15],])
preds[[k]] <- predict(fits[[k]], dataset[sp[16:30],], type="response")
truths[[k]] <- dataset$League[sp[1:15]]
}
plot(unlist(truths), unlist(preds))
The model performs poorly but at least the code runs without problems. The y-axis in the plot shows the estimated probabilities that the examples belong to league N, i.e. ideally the left box should be close to 0 and the right close to 1.

Resources