Labelling the residuals on diagnostic plots - r

I have made a linear regression model in R with 3 continuous independent variables and one continuous dependent variable. I have generated the diagnostic plots.
I would now like to label/colour the data points for each residual on my diagnostic plots according to the binary categorical independent variable that was not included in the model;
i.e. when this variable = A, I want a blue dot on my diagnostic plot,
and when this variable = B, I want a red dot, so there will be red and blue dots on my diagnostic plots.
I would love some advice on how to do this.

[You don't specify what diagnostic plots you're trying to do this to. You also haven't given a minimal reproducible example, which makes it difficult to alter what you were doing to do what you want.]
I'll give an example of the kind of command that does what you need and you may be able to adapt it to whatever displays you need.
library(MASS)
catsmdl <- lm(Hwt~Bwt,cats)
plot(residuals(catsmdl)~fitted(catsmdl), col=cats$Sex)
abline(h=0, col=8, lty=3)
which gives:
This even works with plot.lm, because it has a ... argument to pass information along to the lower level plotting functions. So for example:
opar <- par()
par(mfrow=c(2,2))
plot(catsmdl,col=c("blue","darkorange")[as.numeric(cats$Sex)])
par(opar)
If you replace c("blue","darkorange") with whatever colours you like, it should work. (There are a variety of ways to specify colours in R.)

Related

plotting custom confidence intervals around curve fits R

Say I have some relationships between x and some outcome variable, for three different groups, A,B and C:
x<-c(0:470)/1000
#3 groups, each has a different v-max parameter value.
v.A<-5
v.B<-4
v.C<-3
C<- (v.C*x)/(0.02+x)
B<- (v.B*x)/(0.02+x)
A<-(v.A*x)/(0.02+x)
d.curve<-data.frame(x,A,B,C)
The estimates of the v. parameter also have associated errors:
err.A<-0.24
err.B<-0.22
err.C<-0.29
I'd like to plot these curve fits, as well as shaded error regions around each curve, based on the uncertainty in the v. parameter. So, the shaded region would be +/- one error value. I can generate the plot of the 3 curves easily enough:
limx<-c(0,0.47)
limy<-c(0,5.5)
plot(A~x,data=d.curve,xlim=limx,ylim=limy,col=NA)
lines(smooth.spline(d.curve$x,d.curve$A),col='black',lwd=3)
par(new=T)
plot(B~x,data=d.curve,xlim=limx,ylim=limy,col=NA,ylab=NA,xlab=NA,axes=F)
lines(smooth.spline(d.curve$x,d.curve$B),col='black',lwd=3,lty=2)
par(new=T)
plot(C~x,data=d.curve,xlim=limx,ylim=limy,col=NA,ylab=NA,xlab=NA,axes=F)
lines(smooth.spline(d.curve$x,d.curve$C),col='black',lwd=3,lty=3)
But how can I added custom shaded regions around them, based on specified error terms?
You can add the following code to your current code. The calculation of the error of the line is based on the error (assumed standard error) of the coefficients. You can change the calculation for the error of the line to something else, if desired. The order of plotting might need to be changed to make the polygons appear behind the lines.
# calculating the standard error of the line base on standard error of A,B,C
# could substitute another calculation
se.line.A <- ((x)/(0.02+x))*err.A
se.line.B <- ((x)/(0.02+x))*err.B
se.line.C <- ((x)/(0.02+x))*err.C
# library for polygons
library(graphics)
# plotting polygons
# colors can be changed
# polygons will be drawn over the existing lines
# may change the order of plotting for the shaded regions to be behind line
polygon(c(x,rev(x))
,c(A+se.line.A,rev(A-se.line.A))
,col='gray'
,density=100)
polygon(c(x,rev(x))
,c(B+se.line.B,rev(B-se.line.B))
,col='blue'
,density=100)
polygon(c(x,rev(x))
,c(C+se.line.C,rev(C-se.line.C))
,col='green'
,density=100)

Plot which parameter where in R?

So... I'm looking at an example in a book that goes something like this:
library(daewr)
mod1 <- aov(height ~ time, data=bread)
summary(mod1)
...
par(mfrow=c(2,2))
plot(mod1, which=5)
plot(mod1, which=1)
plot(mod1, which=2)
plot(residuals(mod1) ~ loaf, main="Residuals vs Exp. Units", font.main=1, data=bread)
abline(h = 0, lty = 2)
That all works... but the text is a little vague about the purpose of the parameter 'which='. I dug around in the help (in Rstudio) on plot() and par(), looked around online... found some references to a different 'which()'... but nothing really referring me to the purpose/syntax for the parameter 'which=' inside plot().
A bit later (next page, figures) I found a mention of using names(mod1) to view the list of quantities calculated by aov... which I presume is what which= is refering to, i.e. which item in the list to plot where in the 2x2 matrix of plots. Yay. Now where the heck is that buried in the docs?!?
which selects which plot to be displayed:
A plot of residuals against fitted values
A normal Q-Q plot
A Scale-Location plot of sqrt(| residuals |) against fitted values
A plot of Cook's distances versus row labels
A plot of residuals against leverages
A plot of Cook's distances against leverage/(1-leverage)
By default, the first three and 5 are provided.
Check ?plot.lm in r for more details.

lines() not properly displaying quadratic fit

I'm simply trying to display the fit I've generated using lm(), but the lines function is giving me a weird result in which there are multiple lines coming out of one point.
Here is my code:
library(ISLR)
data(Wage)
lm.mod<-lm(wage~poly(age, 4), data=Wage)
Wage$lm.fit<-predict(lm.mod, Wage)
plot(Wage$age, Wage$wage)
lines(Wage$age, Wage$lm.fit, col="blue")
I've tried resetting my plot with dev.off(), but I've had no luck. I'm using rStudio. FWIW, the line shows up perfectly fine if I make the regression linear only, but as soon as I make it quadratic or higher (using I(age^2) or poly()), I get a weird graph. Also, the points() function works fine with poly().
Thanks for the help.
Because you forgot to order the points by age first, the lines are going to random ages. This is happening for the linear regression too; he reason it works for lines is because traveling along any set of points along a line...stays on the line!
plot(Wage$age, Wage$wage)
lines(sort(Wage$age), Wage$lm.fit[order(Wage$age)], col = 'blue')
Consider increasing the line width for a better view:
lines(sort(Wage$age), Wage$lm.fit[order(Wage$age)], col = 'blue', lwd = 3)
Just to add another more general tip on plotting model predictions:
An often used strategy is to create a new data set (e.g. newdat) which contains a sequence of values for your predictor variables across a range of possible values. Then use this data to show your predicted values. In this data set, you have a good spread of predictor variable values, but this may not always be the case. With the new data set, you can ensure that your line represents evenly distributed values across the variable's range:
Example
newdat <- data.frame(age=seq(min(Wage$age), max(Wage$age),length=1000))
newdat$pred <- predict(lm.mod, newdata=newdat)
plot(Wage$age, Wage$wage, col=8, ylab="Wage", xlab="Age")
lines(newdat$age, newdat$pred, col="blue", lwd=2)

Plotting histograms with R; y axis keeps changing to frequency from proportion/probability

I try to overlay two histograms in the same plane but the option Probability=TRUE (relative frequencies) in hist() is not effective with the code below. It is a problem because the two samples have very different sizes (length(cl1)=9 and length(cl2)=339) and, with this script, I cannot vizualize differences between both histograms because each shows frequencies. How can I overlap two histograms with the same bin width, showing relative frequencies?
c1<-hist(dataList[["cl1"]],xlim=range(minx,maxx),breaks=seq(minx,maxx,pasx),col=rgb(1,0,0,1/4),main=paste(paramlab,"Group",groupnum,"cl1",sep=" "),xlab="",probability=TRUE)
c2<-hist(dataList[["cl2"]],xlim=range(minx,maxx),breaks=seq(minx,maxx,pasx),col=rgb(0,0,1,1/4),main=paste(paramlab,"Group",groupnum,"cl2",sep=" "),xlab="",probability=TRUE)
plot(c1, col=rgb(1,0,0,1/4), xlim=c(minx,maxx), main=paste(paramlab,"Group",groupnum,sep=" "),xlab="")# first histogram
plot(c2, col=rgb(0,0,1,1/4), xlim=c(minx,maxx), add=T)
cl1Col <- rgb(1,0,0,1/4)
cl2Col <- rgb(0,0,1,1/4)
legend('topright',c('Cl1','Cl2'),
fill = c(cl1Col , cl2Col ), bty = 'n',
border = NA)
Thanks in advance for your help!
When you call plot on an object of class histogram (like c1), it calls the S3 method for the histogram. Namely, plot.histogram. You can see the code for this function if you type graphics:::plot.histogram and you can see its help under ?plot.histogram. The help file for that function states:
freq logical; if TRUE, the histogram graphic is to present a
representation of frequencies, i.e, x$counts; if FALSE, relative
frequencies (probabilities), i.e., x$density, are plotted. The default
is true for equidistant breaks and false otherwise.
So, when plot renders a histogram it doesn't use the previously specified probability or freq arguments, it tries to figure it out for itself. The reason for this is obvious if you dig around inside c1, it contains all of the data necessarily for the plot, but does not specify how it should be rendered.
So, the solution is to reiterate the argument freq=FALSE when you run the plot functions. Notably, freq=FALSE works whereas probability=TRUE does not because plot.histogram does not have a probability option. So, your plot code will be:
plot(c1, col=rgb(1,0,0,1/4), xlim=c(minx,maxx), main=paste(paramlab,"Group",groupnum,sep=" "),xlab="",freq=FALSE)# first histogram
plot(c2, col=rgb(0,0,1,1/4), xlim=c(minx,maxx), add=T, freq=FALSE)
This all seems like a oversight/idiosyncratic decision (or lack thereof) on the part of the R devs. To their credit it is appropriately documented and is not "unexpected behavior" (although I certainly didn't expect it). I wonder where such oddness should be reported, if it should be reported at all.

R - logistic curve plot with aggregate points

Let's say I have the following dataset
bodysize=rnorm(20,30,2)
bodysize=sort(bodysize)
survive=c(0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1)
dat=as.data.frame(cbind(bodysize,survive))
I'm aware that the glm plot function has several nice plots to show you the fit,
but I'd nevertheless like to create an initial plot with:
1)raw data points
2)the loigistic curve and both
3)Predicted points
4)and aggregate points for a number of predictor levels
library(Hmisc)
plot(bodysize,survive,xlab="Body size",ylab="Probability of survival")
g=glm(survive~bodysize,family=binomial,dat)
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE)
points(bodysize,fitted(g),pch=20)
All fine up to here.
Now I want to plot the real data survival rates for a given levels of x1
dat$bd<-cut2(dat$bodysize,g=5,levels.mean=T)
AggBd<-aggregate(dat$survive,by=list(dat$bd),data=dat,FUN=mean)
plot(AggBd,add=TRUE)
#Doesn't work
I've tried to match AggBd to the dataset used for the model and all sort of other things but I simply can't plot the two together. Is there a way around this?
I basically want to overimpose the last plot along the same axes.
Besides this specific task I often wonder how to overimpose different plots that plot different variables but have similar scale/range on two-dimensional plots. I would really appreciate your help.
The first column of AggBd is a factor, you need to convert the levels to numeric before you can add the points to the plot.
AggBd$size <- as.numeric (levels (AggBd$Group.1))[AggBd$Group.1]
to add the points to the exisiting plot, use points
points (AggBd$size, AggBd$x, pch = 3)
You are best specifying your y-axis. Also maybe using par(new=TRUE)
plot(bodysize,survive,xlab="Body size",ylab="Probability of survival")
g=glm(survive~bodysize,family=binomial,dat)
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE)
points(bodysize,fitted(g),pch=20)
#then
par(new=TRUE)
#
plot(AggBd$Group.1,AggBd$x,pch=30)
obviously remove or change the axis ticks to prevent overlap e.g.
plot(AggBd$Group.1,AggBd$x,pch=30,xaxt="n",yaxt="n",xlab="",ylab="")
giving:

Resources