Logistic Regression in R using a loop to save code - r

Using R to do some logistic regression using the Boston crime dataset.
This code works just fine:
#################################
library(MASS)
head(Boston)
?Boston
plot(Boston$zn, Boston$crim) #gives scatter plot
lm(formula=Boston$crim~Boston$zn, data=Boston) #gives slope and intercept of best fit line
lm.Boston <-lm(formula=Boston$crim~Boston$zn, data=Boston) #saves information as lm.Boston
abline(lm.Boston) #plots best fit line Adds on to existing plot
abline(v=mean(Boston$zn),col='red') #plots mean for crim
abline(h=mean(Boston$crim),col='red') #plots mean for zn
summary(Boston$zn)
###############################
But I have to replace the $zn with 13 other variable values and I am trying to do it in a loop to save having to repeat the code block 13 times!
Tying this, but get an error
for (i in 2:ncol(Boston)){
clname <- colnames(Boston)[i]
predictor <- paste('Boston$',clname,sep="")
print(predictor)
plot(eval(predictor), Boston$crim) #gives scatter plot
# lm(formula=Boston$crim~predictor, data=Boston) #gives slope and intercept of best fit line
# lm.Boston <-lm(formula=Boston$crim~predictor, data=Boston) #saves information as lm.Boston
# abline(lm.Boston) #plots best fit line Adds on to existing plot
# abline(v=mean(predictor),col='red') #plots mean for crim
# abline(h=mean(Boston$crim),col='red') #plots mean for clname
}
The predictor variable seems to be correct when I print it out, but the first plot statement gives an error (commented out the rest of the code to try and fix this error.
Here is the error I get:
[1] "Boston$zn" Error in xy.coords(x, y, xlabel, ylabel, log) : 'x'
and 'y' lengths differ

You can store the column names in a separate list and then iterate over it. Or you can directly use it in a for loop. I have stored it here in a separate list.
After that, you need to add the proper labels with paste0:
library(MASS)
columns_boston <- colnames(Boston)[2:ncol(Boston)]
for (i in columns_boston){
predictor <- Boston[,i]
print(predictor)
plot(predictor, Boston$crim, xlab=paste0(i), ylab=paste0('crim')) #gives scatter plot
lm(formula=Boston$crim~predictor, data=Boston) #gives slope and intercept of best fit line
lm.Boston <-lm(formula=Boston$crim~predictor, data=Boston) #saves information as lm.Boston
abline(lm.Boston) #plots best fit line Adds on to existing plot
abline(v=mean(predictor),col='red') #plots mean for crim
abline(h=mean(Boston$crim),col='red') #plots mean for clname
}
Sample output for the last column:
You can remove the ylab if you want.

What you want to do (13 different charts) you can do like this
library(tidyverse)
library(MASS)
plotVar = function(data, name) data %>% ggplot(aes(crim, val))+
geom_point()+
stat_smooth(formula =y~x, method="glm")+
ylab(name)
Boston %>% pivot_longer(
-crim, names_to = "var", values_to = "val"
) %>% group_by(var) %>%
nest() %>%
group_map(~plotVar(.x$data[[1]], .y))
First plot
Last plot
This, however, is not a logistic regression!
You have to specify exactly what you want to achieve.

Related

Plot standard error in base r scatterplot [duplicate]

This question already has answers here:
How can I plot data with confidence intervals?
(4 answers)
Closed last month.
I am well aware of how to plot the standard error of a regression using ggplot. As an example with the iris dataset, this can be easily done with this code:
library(tidyverse)
iris %>%
ggplot(aes(x=Sepal.Width,
y=Sepal.Length))+
geom_point()+
geom_smooth(method = "lm",
se=T)
I also know that a regression using base R scatterplots can be achieved with this code:
#### Scatterplot ####
plot(iris$Sepal.Width,
iris$Sepal.Length)
#### Fit Regression ####
fit <- lm(iris$Sepal.Length ~ iris$Sepal.Width)
#### Fit Line to Plot ####
abline(fit, col="red")
However, I've tried looking up how to plot standard error in base R scatterplots, but all I have found both on SO and Google is how to do this with error bars. However, I would like to shade the error in a similar way as ggplot does above. How can one accomplish this?
Edit
To manually obtain the standard error of this regression, I believe you would calculate it like so:
#### Derive Standard Error ####
fit <- lm(Sepal.Length ~ Sepal.Width,
iris)
n <- length(iris)
df <- n-2 # degrees of freedom
y.hat <- fitted(fit)
res <- resid(fit)
sq.res <- res^2
ss.res <- sum(sq.res)
se <- sqrt(ss.res/df)
So if this may allow one to fit it into a base R plot, I'm all ears.
Here's a slightly fiddly approach using broom::augment to generate a dataset with predictions and standard errors. You could also do it in base R with predict if you don't want to use broom but that's a couple of extra lines.
Note: I was puzzled as to why the interval in my graph are narrower than your ggplot interval in the question. But a look at the geom_smooth documentation suggests that the se=TRUE option adds a 95% confidence interval rather than +-1 se as you might expect. So its probably better to generate your own intervals rather than letting the graphics package do it!
#### Fit Regression (note use of `data` argument) ####
fit <- lm(data=iris, Sepal.Length ~ Sepal.Width)
#### Generate predictions and se ####
dat <- broom::augment(fit, se_fit = TRUE)
#### Alternative using `predict` instead of broom ####
dat <- cbind(iris,
.fitted=predict(fit, newdata = iris),
.se.fit=predict(fit, newdata = iris, se.fit = TRUE)$se.fit)
#### Now sort the dataset in the x-axis order
dat <- dat[order(dat$`Sepal.Width`),]
#### Plot with predictions and standard errors
with(dat, {
plot(Sepal.Width,Sepal.Length)
polygon(c(Sepal.Width, rev(Sepal.Width)), c(.fitted+.se.fit, rev(.fitted-.se.fit)), border = NA, col = hsv(1,1,1,0.2))
lines(Sepal.Width,.fitted, lwd=2)
lines(Sepal.Width,.fitted+.se.fit, col="red")
lines(Sepal.Width,.fitted-.se.fit, col="red")
})

Find Cook's Distance on Predicted Values for LM

Problem
I would like to use Cook's distance to identify outliers in my predicted data.
Background
I know it is easy to find the outliers in the original data used to build a linear model using cooks.distance() (illustrated in Example 1 below).
More Explanation of Problem
When I fit new data with that model (using predict()), I can't see how to get the Cook's distance on the new points since cooks.distance() only operates on a model object. I understand that it is calculated by a leave-one-out method iteratively rebuilding the model so perhaps it doesn't make sense to calculate it on fitted values but I was hoping that I'm missing something simple about how one might approach this.
Desired Output
In Example 2 below I show the predicted values where I'd like to highlight outliers in by their Cook's D, but since I didn't know how to do it I just used their residual to illustrate something close to my desired output.
Example 1
# subset data
a <- mtcars[1:16,]
# build model on one half
m <- lm(mpg ~ disp, a)
# find outliers
c <- cooks.distance(m)
# visualize outliers with cook's d
pal <- colorRampPalette(c("black", "red"))(102)
with(a,
plot(mpg ~ disp,
col = pal[1 + round(100 * scale(c, min(c), max(c)))],
pch = 19,
main = "Color by Cook's D")); abline(m)
Example 2
# predict on full data and add residuals
b <- mtcars
b$pred_mpg <- predict(m, mtcars)
b$resid <- b$mpg - b$pred_mpg
# visualize outliers in full data by residuals
with(b,
plot(mpg ~ disp,
pch = 19,
col = pal[1 + round(100 * scale(resid, min(resid), max(resid)))],
main = "Color by Residual")); abline(m)
Created on 2022-03-10 by the reprex package (v2.0.1)

Plotting a Regression Line Without Extracting Each Coefficient Separately [duplicate]

This question already has answers here:
How to plot a comparisson of two fixed categorical values for linear regression of another continuous variable
(3 answers)
Closed 4 years ago.
For my Stats class, we are using R to compute all of our statistics, and we are working with numeric data that also has a categorical factor. The way we currently are plotting fitted lines is with lm() and then looking at the summary to grab the coefficients manually, create a mesh, and then use the lines() function. I am wanting a way to do this easier. I have seen the predict() function, but not how to use this along with categories.
For example, the data set found here has 2 numerical variables, and one categorical. I want to be able plot the line of best fit for men and women in this set without having to extract each coefficient individually, as below in my current code.
bank<-read.table("http://www.uwyo.edu/crawford/datasets/bank.txt",header=TRUE)
fit <-lm(salary~years*gender,data=bank)
summary(fit)
yearhat<-seq(0,max(bank$salary),length=1000)
salaryfemalehat=fit$coefficients[1]+fit$coefficients[2]*yearhat
salarymalehat=(fit$coefficients[1]+fit$coefficients[3])+(fit$coefficients[2]+fit$coefficients[4])*yearhat
Using what you have, you can get the same predicted values with
yearhat<-seq(0,max(bank$salary),length=1000)
salaryfemalehat <- predict(fit, data.frame(years=yearhat, gender="Female"))
salarymalehat <- predict(fit, data.frame(years=yearhat, gender="Male"))
To supplement MrFlick, in case of more levels we can try:
dat <- mtcars
dat$cyl <- as.factor(dat$cyl)
fit <- lm(mpg ~ disp*cyl, data = dat)
plot(dat$disp, dat$mpg)
with(dat,
for(i in levels(cyl)){
lines(disp, predict(fit, newdata = data.frame(disp = disp, cyl = i))
, col = which(levels(cyl) == i))
}
)

Add raw data points to jp.int (sjPlot)

For my manuscript, I plotted a lme with an interaction of two continuous variables:
Create data
mydata <- data.frame( SID=sample(1:150,400,replace=TRUE),age=sample(50:70,400,replace=TRUE), sex=sample(c("Male","Female"),200, replace=TRUE),time= seq(0.7, 6.2, length.out=400), Vol =rnorm(400),HCD =rnorm(400))
mydata$time <- as.numeric(mydata$time)
Run the model:
model <- lme(HCD ~ age*time+sex*time+Vol*time, random=~time|SID, data=mydata)
Make plot:
sjp.int(model, swap.pred=T, show.ci=T, mdrt.values="meansd")
The reviewer now wants me to add the raw data points to this plot. How can I do this? I tried adding geom_point() referring to mydata, but that is not possible.
Any ideas?
Update:
I thought that maybe I could extract the random slope of HCD and then residuals HCD for the covariates and also residuals Vol for the covariates and plot those two to make things easier (then I could plot the points in a 2D plot).
So, I tried to extract the slopes and use these to fit a linear regression, but the results are different (in the reproducible example less significant, but in my data: the interaction became non-significant (and was significant in the lme)). Not sure what that means or whether this just shows that I should not try to plot it this way.
get the slopes:
model <- lme(HCD ~ time, random=~time|SID, data=mydata)
slopes <- rbind(row.names(model$coefficients$random$SID), model$coef$random$SID[,2])
slopes2 <- data.frame(matrix(unlist(slopes), nrow=144, byrow=T))
names(slopes2)[1] <- "SID"
names(slopes2)[2] <- "slopes"
(save the slopes2 and reopen, because somehow R sees it as a factor)
Then create a cross-sectional dataframe and merge the slopes:
mydata$time2 <- round(mydata$time)
new <- reshape(mydata,idvar = "SID", timevar="time2", direction="wide")
newdata <- dplyr::left_join(new, slop, by="SID")
The lm:
modelw <- lm(slop$slopes ~ age.1+sex.1+Vol.1, data=newdata)
Vol now has a p-value of 0.8 (previously this was 0.14)

Conditionally colour data points outside of confidence bands in R

I need to colour datapoints that are outside of the the confidence bands on the plot below differently from those within the bands. Should I add a separate column to my dataset to record whether the data points are within the confidence bands? Can you provide an example please?
Example dataset:
## Dataset from http://www.apsnet.org/education/advancedplantpath/topics/RModules/doc1/04_Linear_regression.html
## Disease severity as a function of temperature
# Response variable, disease severity
diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)
# Predictor variable, (Centigrade)
temperature<-c(2,1,5,5,20,20,23,10,30,25)
## For convenience, the data may be formatted into a dataframe
severity <- as.data.frame(cbind(diseasesev,temperature))
## Fit a linear model for the data and summarize the output from function lm()
severity.lm <- lm(diseasesev~temperature,data=severity)
# Take a look at the data
plot(
diseasesev~temperature,
data=severity,
xlab="Temperature",
ylab="% Disease Severity",
pch=16,
pty="s",
xlim=c(0,30),
ylim=c(0,30)
)
title(main="Graph of % Disease Severity vs Temperature")
par(new=TRUE) # don't start a new plot
## Get datapoints predicted by best fit line and confidence bands
## at every 0.01 interval
xRange=data.frame(temperature=seq(min(temperature),max(temperature),0.01))
pred4plot <- predict(
lm(diseasesev~temperature),
xRange,
level=0.95,
interval="confidence"
)
## Plot lines derrived from best fit line and confidence band datapoints
matplot(
xRange,
pred4plot,
lty=c(1,2,2), #vector of line types and widths
type="l", #type of plot for each column of y
xlim=c(0,30),
ylim=c(0,30),
xlab="",
ylab=""
)
Well, I thought that this would be pretty easy with ggplot2, but now I realize that I have no idea how the confidence limits for stat_smooth/geom_smooth are calculated.
Consider the following:
library(ggplot2)
pred <- as.data.frame(predict(severity.lm,level=0.95,interval="confidence"))
dat <- data.frame(diseasesev,temperature,
in_interval = diseasesev <=pred$upr & diseasesev >=pred$lwr ,pred)
ggplot(dat,aes(y=diseasesev,x=temperature)) +
stat_smooth(method='lm') + geom_point(aes(colour=in_interval)) +
geom_line(aes(y=lwr),colour=I('red')) + geom_line(aes(y=upr),colour=I('red'))
This produces:
alt text http://ifellows.ucsd.edu/pmwiki/uploads/Main/strangeplot.jpg
I don't understand why the confidence band calculated by stat_smooth is inconsistent with the band calculated directly from predict (i.e. the red lines). Can anyone shed some light on this?
Edit:
figured it out. ggplot2 uses 1.96 * standard error to draw the intervals for all smoothing methods.
pred <- as.data.frame(predict(severity.lm,se.fit=TRUE,
level=0.95,interval="confidence"))
dat <- data.frame(diseasesev,temperature,
in_interval = diseasesev <=pred$fit.upr & diseasesev >=pred$fit.lwr ,pred)
ggplot(dat,aes(y=diseasesev,x=temperature)) +
stat_smooth(method='lm') +
geom_point(aes(colour=in_interval)) +
geom_line(aes(y=fit.lwr),colour=I('red')) +
geom_line(aes(y=fit.upr),colour=I('red')) +
geom_line(aes(y=fit.fit-1.96*se.fit),colour=I('green')) +
geom_line(aes(y=fit.fit+1.96*se.fit),colour=I('green'))
The easiest way is probably to calculate a vector of TRUE/FALSE values that indicate if a data point is inside of the confidence interval or not. I'm going to reshuffle your example a little bit so that all of the calculations are completed before the plotting commands are executed- this provides a clean separation in the program logic that could be exploited if you were to package some of this into a function.
The first part is pretty much the same, except I replaced the additional call to lm() inside predict() with the severity.lm variable- there is no need to use additional computing resources to recalculate the linear model when we already have it stored:
## Dataset from
# apsnet.org/education/advancedplantpath/topics/
# RModules/doc1/04_Linear_regression.html
## Disease severity as a function of temperature
# Response variable, disease severity
diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)
# Predictor variable, (Centigrade)
temperature<-c(2,1,5,5,20,20,23,10,30,25)
## For convenience, the data may be formatted into a dataframe
severity <- as.data.frame(cbind(diseasesev,temperature))
## Fit a linear model for the data and summarize the output from function lm()
severity.lm <- lm(diseasesev~temperature,data=severity)
## Get datapoints predicted by best fit line and confidence bands
## at every 0.01 interval
xRange=data.frame(temperature=seq(min(temperature),max(temperature),0.01))
pred4plot <- predict(
severity.lm,
xRange,
level=0.95,
interval="confidence"
)
Now, we'll calculate the confidence intervals for the origional data points and run a test to see if the points are inside the interval:
modelConfInt <- predict(
severity.lm,
level = 0.95,
interval = "confidence"
)
insideInterval <- modelConfInt[,'lwr'] < severity[['diseasesev']] &
severity[['diseasesev']] < modelConfInt[,'upr']
Then we'll do the plot- first a the high-level plotting function plot(), as you used it in your example, but we will only plot the points inside the interval. We will then follow up with the low-level function points() which will plot all the points outside the interval in a different color. Finally, matplot() will be used to fill in the confidence intervals as you used it. However instead of calling par(new=TRUE) I prefer to pass the argument add=TRUE to high-level functions to make them act like low level functions.
Using par(new=TRUE) is like playing a dirty trick a plotting function- which can have unforeseen consequences. The add argument is provided by many functions to cause them to add information to a plot rather than redraw it- I would recommend exploiting this argument whenever possible and fall back on par() manipulations as a last resort.
# Take a look at the data- those points inside the interval
plot(
diseasesev~temperature,
data=severity[ insideInterval,],
xlab="Temperature",
ylab="% Disease Severity",
pch=16,
pty="s",
xlim=c(0,30),
ylim=c(0,30)
)
title(main="Graph of % Disease Severity vs Temperature")
# Add points outside the interval, color differently
points(
diseasesev~temperature,
pch = 16,
col = 'red',
data = severity[ !insideInterval,]
)
# Add regression line and confidence intervals
matplot(
xRange,
pred4plot,
lty=c(1,2,2), #vector of line types and widths
type="l", #type of plot for each column of y
add = TRUE
)
I liked the idea and tried to make a function for that. Of course it's far from being perfect. Your comments are welcome
diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)
# Predictor variable, (Centigrade)
temperature<-c(2,1,5,5,20,20,23,10,30,25)
## For convenience, the data may be formatted into a dataframe
severity <- as.data.frame(cbind(diseasesev,temperature))
## Fit a linear model for the data and summarize the output from function lm()
severity.lm <- lm(diseasesev~temperature,data=severity)
# Function to plot the linear regression and overlay the confidence intervals
ci.lines<-function(model,conf= .95 ,interval = "confidence"){
x <- model[[12]][[2]]
y <- model[[12]][[1]]
xm<-mean(x)
n<-length(x)
ssx<- sum((x - mean(x))^2)
s.t<- qt(1-(1-conf)/2,(n-2))
xv<-seq(min(x),max(x),(max(x) - min(x))/100)
yv<- coef(model)[1]+coef(model)[2]*xv
se <- switch(interval,
confidence = summary(model)[[6]] * sqrt(1/n+(xv-xm)^2/ssx),
prediction = summary(model)[[6]] * sqrt(1+1/n+(xv-xm)^2/ssx)
)
# summary(model)[[6]] = 'sigma'
ci<-s.t*se
uyv<-yv+ci
lyv<-yv-ci
limits1 <- min(c(x,y))
limits2 <- max(c(x,y))
predictions <- predict(model, level = conf, interval = interval)
insideCI <- predictions[,'lwr'] < y & y < predictions[,'upr']
x_name <- rownames(attr(model[[11]],"factors"))[2]
y_name <- rownames(attr(model[[11]],"factors"))[1]
plot(x[insideCI],y[insideCI],
pch=16,pty="s",xlim=c(limits1,limits2),ylim=c(limits1,limits2),
xlab=x_name,
ylab=y_name,
main=paste("Graph of ", y_name, " vs ", x_name,sep=""))
abline(model)
points(x[!insideCI],y[!insideCI], pch = 16, col = 'red')
lines(xv,uyv,lty=2,col=3)
lines(xv,lyv,lty=2,col=3)
}
Use it like this:
ci.lines(severity.lm, conf= .95 , interval = "confidence")
ci.lines(severity.lm, conf= .85 , interval = "prediction")

Resources