Plot "regression line" from multiple regression in R - r

I ran a multiple regression with several continuous predictors, a few of which came out significant, and I'd like to create a scatterplot or scatter-like plot of my DV against one of the predictors, including a "regression line". How can I do this?
My plot looks like this
D = my.data; plot( D$probCategorySame, D$posttestScore )
If it were simple regression, I could add a regression line like this:
lmSimple <- lm( posttestScore ~ probCategorySame, data=D )
abline( lmSimple )
But my actual model is like this:
lmMultiple <- lm( posttestScore ~ pretestScore + probCategorySame + probDataRelated + practiceAccuracy + practiceNumTrials, data=D )
I would like to add a regression line that reflects the coefficient and intercept from the actual model instead of the simplified one. I think I'd be happy to assume mean values for all other predictors in order to do this, although I'm ready to hear advice to the contrary.
This might make no difference, but I'll mention just in case, the situation is complicated slightly by the fact that I probably will not want to plot the original data. Instead, I'd like to plot mean values of the DV for binned values of the predictor, like so:
D[,'probCSBinned'] = cut( my.data$probCategorySame, as.numeric( seq( 0,1,0.04 ) ), include.lowest=TRUE, right=FALSE, labels=FALSE )
D = aggregate( posttestScore~probCSBinned, data=D, FUN=mean )
plot( D$probCSBinned, D$posttestScore )
Just because it happens to look much cleaner for my data when I do it this way.

To plot the individual terms in a linear or generalised linear model (ie, fit with lm or glm), use termplot. No need for binning or other manipulation.
# plot everything on one page
par(mfrow=c(2,3))
termplot(lmMultiple)
# plot individual term
par(mfrow=c(1,1))
termplot(lmMultiple, terms="preTestScore")

You need to create a vector of x-values in the domain of your plot and predict their corresponding y-values from your model. To do this, you need to inject this vector into a dataframe comprised of variables that match those in your model. You stated that you are OK with keeping the other variables fixed at their mean values, so I have used that approach in my solution. Whether or not the x-values you are predicting are actually legal given the other values in your plot should probably be something you consider when setting this up.
Without sample data I can't be sure this will work exactly for you, so I apologize if there are any bugs below, but this should at least illustrate the approach.
# Setup
xmin = 0; xmax=10 # domain of your plot
D = my.data
plot( D$probCategorySame, D$posttestScore, xlim=c(xmin,xmax) )
lmMultiple <- lm( posttestScore ~ pretestScore + probCategorySame + probDataRelated + practiceAccuracy + practiceNumTrials, data=D )
# create a dummy dataframe where all variables = their mean value for each record
# except the variable we want to plot, which will vary incrementally over the
# domain of the plot. We need this object to get the predicted values we
# want to plot.
N=1e4
means = colMeans(D)
dummyDF = t(as.data.frame(means))
for(i in 2:N){dummyDF=rbind(dummyDF,means)} # There's probably a more elegant way to do this.
xv=seq(xmin,xmax, length.out=N)
dummyDF$probCSBinned = xv
# if this gives you a warning about "Coercing LHS to list," use bracket syntax:
#dummyDF[,k] = xv # where k is the column index of the variable `posttestScore`
# Getting and plotting predictions over our dummy data.
yv=predict(lmMultiple, newdata=subset(dummyDF, select=c(-posttestScore)))
lines(xv, yv)

Look at the Predict.Plot function in the TeachingDemos package for one option to plot one predictor vs. the response at a given value of the other predictors.

Related

Obtaining confidence interval for npreg as values, not as plot

I am using the well known "np" package of Hayfield & Racine for non-parametric regressions. It allows plotting confidence bands for the estimated coefficient based on bootstrap procedures. See the code below for an example.
Question: I am wondering how to obtain these confidence intervalls in numerical form? One, but not the only reason for this question is that I really don't like the presentation of the ci's. More generally speaking, I would like to use and further process the confidence band within my analysis.
library(np)
# generate random variables:
x <- 1:100 + rnorm(100)/2
y <- (1:100)^(0.25) + rnorm(100)/2
mynp <- npreg(y~x)
plot(mynp, plot.errors.method="bootstrap")`
when executing plot, the function is calling to the plot method of np package which is the function npplot
npplot exepts an argument plot.behavior which equals to plot by default which plots the results and returns NULL. you should set plot.behavior = "plot-data", and the function will plot and return the data of the object.
dat <- plot(mynp, plot.errors.method="bootstrap",plot.behavior = "plot-data")
Than the values in the line can be accesed through dat$r1$mean and the values to be added to the mean to get the upper and lower ci accesed through dat$r1$merr.
notice that not all value are plotted. only half of them (every other value and than the last).
read the 'help' on npplot for more options.
further is an example of the use of the code and the results:
library(np)
# generate random variables:
x <- 1:100 + rnorm(100)/2
y <- (1:100)^(0.25) + rnorm(100)/2
mynp <- npreg(y~x)
dat <- plot(mynp, plot.errors.method="bootstrap",plot.behavior = "plot-data")
Then recreating the results:
z <- unlist(dat$r1$eval,use.names = F)
CI.up = as.numeric(dat$r1$mean)+as.numeric(dat$r1$merr[,2])
CI.dn = as.numeric(dat$r1$mean)+as.numeric(dat$r1$merr[,1])
plot(dat$r1$mean~z, cex=1.5,xaxt='n', ylim=c(1.0,3.5),xlab='',ylab='lalala!', main='blahblahblah',col='blue',pch=16)
arrows(z,CI.dn,z,CI.up,code=3,length=0.2,angle=90,col='red')
we will get:
As you can see, theresults are the same (only I have calculated the intervals for each point and not only for half of them).
note the plot.errors.type attribute for npplot which gets "standard" and "quantiles" and is "standard" at default. When you specify "standard" dat$r1$merr will keep the standard errors and the plot will include mean+std err as intervals. Alternatively the plot will include the quantiles as the intervals and the quantiles will be saved at dat$r1$merr. which quntiles to use are specified by plot.errors.quantiles quantiles and it's only relevant if plot.errors.type = "quantiles"

How to draw my function to plot with data in R

I have data about response time at web site according users that hit at the same time.
For example:
10 users hit the same time have (average) response time 300ms
20 users -> 450ms etc
I import the data in R and I make the plot from 2 columns data (users, response time).
Also I use the function loess to draw a line about those points, at the plot.
Here's the code that I have wrote:
users <- seq(5,250, by=5)
responseTime <- c(179.5,234.0,258.5,382.5,486.0,679.0,594.0,703.5,998.0,758.0,797.0,812.0,804.5,890.5,1148.5,1182.5,1298.0,1422.0,1413.5,1209.5,1488.0,1632.0,1715.0,1632.5,2046.5,1860.5,2910.0,2836.0,2851.5,3781.0,2725.0,3036.0,2862.0,3266.0,3175.0,3599.0,3563.0,3375.0,3110.0,2958.0,3407.0,3035.5,3040.0,3378.0,3493.0,3455.5,3268.0,3635.0,3453.0,3851.5)
data1 <- data.frame(users,responseTime)
data1
plot(data1, xlab="Users", ylab="Response Time (ms)")
lines(data1)
loess_fit <- loess(responseTime ~ users, data1)
lines(data1$users, predict(loess_fit), col = "green")
Here's my plot's image:
My questions are:
How to draw my nonlinear function at the same plot to compare it with the other lines?
example: response_time (f(x)) = 30*users^2.
Also how to make predictions for the line of function loess and for my function and show them to the plot, example: if I have data until 250 users, make prediction until 500 users
If you know the equation of the line that you want to draw, then just define a variable for your prediction:
predictedResponseTime <- 30 * users ^ 2
lines(users, predictedResponseTime)
If the problem is that you want to fit a line, then you need to call a modelling function.
Since loess is a non-parametric model, is isn't appropriate to use it to make predictions outside of the range of your data.
In this case, a simple (ordinary least squares) linear regression using lm provides a reasonable fit.
model <- lm(responseTime ~ users)
prediction <- data.frame(users = 1:500)
prediction$responseTime <- predict(model, prediction)
with(prediction, lines(users, responseTime))
Another solution to plot your curve knowing the underlying function is function curve.
In your example of f(x)=30x^2:
plot(data1, xlab="Users", ylab="Response Time (ms)")
lines(data1)
lines(data1$users, predict(loess_fit), col = "green")
curve(30*x^2,col="red", add=TRUE) #Don't forget the add parameter.

geom_smooth on a subset of data

Here is some data and a plot:
set.seed(18)
data = data.frame(y=c(rep(0:1,3),rnorm(18,mean=0.5,sd=0.1)),colour=rep(1:2,12),x=rep(1:4,each=6))
ggplot(data,aes(x=x,y=y,colour=factor(colour)))+geom_point()+ geom_smooth(method='lm',formula=y~x,se=F)
As you can see the linear regression is highly influenced by the values where x=1.
Can I get linear regressions calculated for x >= 2 but display the values for x=1 (y equals either 0 or 1).
The resulting graph would be exactly the same except for the linear regressions. They would not "suffer" from the influence of the values on abscisse = 1
It's as simple as geom_smooth(data=subset(data, x >= 2), ...). It's not important if this plot is just for yourself, but realize that something like this would be misleading to others if you don't include a mention of how the regression was performed. I'd recommend changing transparency of the points excluded.
ggplot(data,aes(x=x,y=y,colour=factor(colour)))+
geom_point(data=subset(data, x >= 2)) + geom_point(data=subset(data, x < 2), alpha=.2) +
geom_smooth(data=subset(data, x >= 2), method='lm',formula=y~x,se=F)
The regular lm function has a weights argument which you can use to assign a weight to a particular observation. In this way you can plain with the influence which the observation has on the outcome. I think this is a general way of dealing with the problem in stead of subsetting the data. Of course, assigning weights ad hoc does not bode well for the statistical soundness of the analysis. It is always best to have a rationale behind the weights, e.g. low weight observations have a higher uncertainty.
I think under the hood ggplot2 uses the lm function so you should be able to pass the weights argument. You can add the weights through the aesthetic (aes), assuming that the weight is stored in a vector:
ggplot(data,aes(x=x,y=y,colour=factor(colour))) +
geom_point()+ stat_smooth(aes(weight = runif(nrow(data))), method='lm')
you could also put weight in a column in the dataset:
ggplot(data,aes(x=x,y=y,colour=factor(colour))) +
geom_point()+ stat_smooth(aes(weight = weight), method='lm')
where the column is called weight.
I tried #Matthew Plourde's solution, but subset did not work for me. It did not change anything when I used the subset compared to the original data. I replaced subset with filter and it worked:
ggplot(data,aes(x=x,y=y,colour=factor(colour)))+
geom_point(data=data[data$x >= 2,]) + geom_point(data=data[data$x < 2,], alpha=.2) +
geom_smooth(data=data[data$x >= 2,], method='lm',formula=y~x,se=F)

grouped loess scores

I've got this data.frame: http://sprunge.us/TMGS, and I'd like to calculate the loess of Intermediate.MAP.Score ~ x, so I get one curve from the whole dataset. But every group (by name) should have the same wight as every other group, I'm not sure what happens if I call loess over the whole data.frame. Do I need to call it once per group and combine them? If yes, how do I do that?
If you want to average over all of the values in 'loess.fits' produced in my earlier answer to a different question, you will get one answer. If you want to just get a loess fit on the entire dataset (which would not fit your "equal weighting" spec at least as I interpret that phrase), you will get another answer.
This would produce averaged 'yhat' values at the 51 equally spaced data values for 'x' in the range of [0,1]. Because of missing values, it may not be exactly "equally weighted" but only at the extremes. The estimates are dense elsewhere:
apply( as.data.frame(loess.fits), 1, mean, na.rm=TRUE)
Earlier answer:
I would have titled the question "loess scores split by group":
plot(dat$x, dat$Intermediate.MAP.Score, col=as.numeric(factor(dat$name)) )
If you proceed with loess(Intermediate.MAP.Score ~ x, data=dat) you will get an overall average with no distinction among groups. And loess doesn't accept factor or character arguments in its formula. You need to split by 'name' and calculate separately. The other gotcha to avoid is plotting on the default limits which will be driven varying data ranges:
loess.fits <- lapply(split(dat, dat$name), function(xdf) {
list( yhat=predict( loess(Intermediate.MAP.Score ~ x,
data=xdf[ complete.cases(
xdf[ , c("Intermediate.MAP.Score", "x") ]
),
] ) ,
newdata=data.frame(x=seq(0,1,by=0.02))))})
plot(dat$x, dat$Intermediate.MAP.Score,
col=as.numeric(factor(dat$name)),
ylim=c(0.2,1) )
lapply(loess.fits, function(xdf) { par(new=TRUE);
# so the plots can be compared to predictions
plot(x= seq(0,1,by=0.02), y=xdf$yhat,
ylab="", xlab="",
ylim=c(0.2,1), axes=FALSE) })

Create function to automatically create plots from summary(fit <- lm( y ~ x1 + x2 +... xn))

I am running the same regression with small alterations of x variables several times. My aim is after having determined the fit and significance of each variable for this linear regression model to view all all major plots. Instead of having to create each plot one by one, I want a function to loop through my variables (x1...xn) from the following list.
fit <-lm( y ~ x1 + x2 +... xn))
The plots I want to create for all x are
1) 'x versus y' for all x in the function above
2) 'x versus predicted y
3) x versus residuals
4) x versus time, where time is not a variable used in the regression but provided in the dataframe the data comes from.
I know how to access the coefficients from fit, however I am not able to use the coefficient names from the summary and reuse them in a function for creating the plots, as the names are characters.
I hope my question has been clearly described and hasn't been asked already.
Thanks!
Create some mock data
dat <- data.frame(x1=rnorm(100), x2=rnorm(100,4,5), x3=rnorm(100,8,27),
x4=rnorm(100,-6,0.1), t=(1:100)+runif(100,-2,2))
dat <- transform(dat, y=x1+4*x2+3.6*x3+4.7*x4+rnorm(100,3,50))
Make the fit
fit <- lm(y~x1+x2+x3+x4, data=dat)
Compute the predicted values
dat$yhat <- predict(fit)
Compute the residuals
dat$resid <- residuals(fit)
Get a vector of the variable names
vars <- names(coef(fit))[-1]
A plot can be made using this character representation of the name if you use it to build a string version of a formula and translate that. The four plots are below, and the are wrapped in a loop over all the vars. Additionally, this is surrounded by setting ask to TRUE so that you get a chance to see each plot. Alternatively you arrange multiple plots on the screen, or write them all to files to review later.
opar <- par(ask=TRUE)
for (v in vars) {
plot(as.formula(paste("y~",v)), data=dat)
plot(as.formula(paste("yhat~",v)), data=dat)
plot(as.formula(paste("resid~",v)), data=dat)
plot(as.formula(paste("t~",v)), data=dat)
}
par(opar)
The coefficients are stored in the fit objects as you say, but you can access them generically in a function by referring to them this way:
x <- 1:10
y <- x*3 + rnorm(1)
plot(x,y)
fit <- lm(y~x)
fit$coefficient[1] # intercept
fit$coefficient[2] # slope
str(fit) # a lot of info, but you can see how the fit is stored
My guess is when you say you know how to access the coefficients you are getting them from summary(fit) which is a bit harder to access than taking them directly from the fit. By using fit$coeff[1] etc you don't have to have the name of the variable in your function.
Three options to directly answer what I think was the question: How to access the coefficients using character arguments:
x <- 1:10
y <- x*3 + rnorm(1)
fit <- lm(y~x)
# 1
fit$coefficient["x"]
# 2
coefname <- "x"
fit$coefficient[coefname]
#3
coef(fit)[coefname]
If the question was how to plot the various functions then you should supply a sufficiently complex construction (in R) to allow demonstration of methods with a well-specified set of objects.

Resources