Plotting a Regression Line Without Extracting Each Coefficient Separately [duplicate] - r

This question already has answers here:
How to plot a comparisson of two fixed categorical values for linear regression of another continuous variable
(3 answers)
Closed 4 years ago.
For my Stats class, we are using R to compute all of our statistics, and we are working with numeric data that also has a categorical factor. The way we currently are plotting fitted lines is with lm() and then looking at the summary to grab the coefficients manually, create a mesh, and then use the lines() function. I am wanting a way to do this easier. I have seen the predict() function, but not how to use this along with categories.
For example, the data set found here has 2 numerical variables, and one categorical. I want to be able plot the line of best fit for men and women in this set without having to extract each coefficient individually, as below in my current code.
bank<-read.table("http://www.uwyo.edu/crawford/datasets/bank.txt",header=TRUE)
fit <-lm(salary~years*gender,data=bank)
summary(fit)
yearhat<-seq(0,max(bank$salary),length=1000)
salaryfemalehat=fit$coefficients[1]+fit$coefficients[2]*yearhat
salarymalehat=(fit$coefficients[1]+fit$coefficients[3])+(fit$coefficients[2]+fit$coefficients[4])*yearhat

Using what you have, you can get the same predicted values with
yearhat<-seq(0,max(bank$salary),length=1000)
salaryfemalehat <- predict(fit, data.frame(years=yearhat, gender="Female"))
salarymalehat <- predict(fit, data.frame(years=yearhat, gender="Male"))

To supplement MrFlick, in case of more levels we can try:
dat <- mtcars
dat$cyl <- as.factor(dat$cyl)
fit <- lm(mpg ~ disp*cyl, data = dat)
plot(dat$disp, dat$mpg)
with(dat,
for(i in levels(cyl)){
lines(disp, predict(fit, newdata = data.frame(disp = disp, cyl = i))
, col = which(levels(cyl) == i))
}
)

Related

Predict Future values using polynomial regression in R

Was trying to predict the future value of a sample using polynomial regression in R. The y values within the sample forms a wave pattern.
For example
x = 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
y= 1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4
But when the graph is plotted for future values the resultant y values was completely different from what was expected. Instead of a wave pattern, was getting a graph where the y values keep increasing.
futurY = 17,18,19,20,21,22
Tried different degrees of polynomial regression, but the predicted results for futurY were drastically different from what was expected
Following is the sample R code which was used to get the results
dfram <- data.frame('x'=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
dfram$y <- c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4)
plot(dfram,dfram$y,type="l", lwd=3)
pred <- data.frame('x'=c(17,18,19,20,21,22))
myFit <- lm(y ~ poly(x,5), data=dfram)
newdata <- predict(myFit, pred)
print(newdata)
plot(pred[,1],data.frame(newdata)[,1],type="l",col="red", lwd=3)
Is this the correct technique to be used for predicting the unknown future y values OR should I be using other techniques like forecasting?
# Reproducing your data frame
dfram <- data.frame("x" = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16),
"y" = c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4))
From your graph I've got the phase and period of the signal. There're better ways of calculating that automatically.
# Phase and period
fase = 1
per = 10
In the linear model function I've put the triangular signal equations.
fit <- lm(y ~ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * (x-fase)%%(per/2))
+ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * ((per/2)-((x-fase)%%(per/2))))
,data=dfram)
# Predict the old data
p_olddata <- predict(fit,type="response")
# Predict the new data
newdata <- data.frame('x'=c(17,18,19,20,21,22))
p_newdata <- predict(fit,newdata,type="response")
# Ploting Old and new data
plot(x=c(dfram$x,newdata$x),
y=c(p_olddata,p_newdata),
col=c(rep("blue",length(p_olddata)),rep("green",length(p_olddata))),
xlab="x",
ylab="y")
lines(dfram)
Where the black line is the original signal, the blue circles are the prediction for the original points and the green circles are the prediction for the new data.
The graph shows a perfect fit for the model because there's no noise in the data. In a real dataset you may find it so the fit will not look as nice as that.

How to subset a range of values in lm()

The help file for lm() doesn't go into the syntax for the subset argument. I am not sure how to get it to find the line of best fit for only a portion of my data set. This question is similar, but I wasn't able to solve my particular problem using it. How does the subset argument work in the lm() function?
Here is my code:
with(dat[dat$SIZE <7 & dat$SIZE > 0.8 ,], plot(SP.RICH~SIZE, log="x",
xlim=c(1,9), ylim=c(60,180), ylab="plant species richness",
xlab="log area (ha)", type="n"))
with(dat[dat$SIZE <7 & dat$SIZE > 0.8 ,], points(SP.RICH~SIZE, pch=20, cex=1))
fit=lm(SP.RICH~SIZE, subset=c(1:7))
I would like to make sure that the regression line is drawn only for the values that I subset above in the plot() and points() commands.
The subset parameter in lm() and other model fitting functions takes as its argument a logical vector the length of the dataframe, evaluated in the environment of the dataframe. So, if I understand you correctly, I would use the following:
fit <- lm(SP.RICH~SIZE, data=dat, subset=(SIZE>0.8 & SIZE<7))
But the above solution does not help if you want to run one lm for each group in your data - lets say that you have different countries as a column and you want to understand the relationship between richness and size within each country.
For that I recommend following the help for the function by in R http://astrostatistics.psu.edu/su07/R/html/base/html/by.html:
require(stats)
attach(warpbreaks)
by(warpbreaks[, 1:2], tension, summary)
by(warpbreaks[, 1], list(wool = wool, tension = tension), summary)
by(warpbreaks, tension, function(x) lm(breaks ~ wool, data = x))
## now suppose we want to extract the coefficients by group
tmp <- by(warpbreaks, tension, function(x) lm(breaks ~ wool, data = x))
sapply(tmp, coef)
From the list tmp you can extract any lm parameters you like.

order random effects plot by size of parameter estimates (using nlme)

I want to make a random effects plot using the plot method for ranef objects (plot.ranef.lme).
library(nlme)
x <- Orthodont
# change factor to unordered for this example
x$Subject <- factor(x$Subject, ordered=FALSE)
m <- lme(distance ~ age, x, random = ~ 1 | Subject)
re <- ranef(m)
plot(re)
Above, the order of the factor on the y-axis follows the order of the factor levels.
Now, I want the order of the levels to correspond to the size of the random effect parameters. The best I could come up with is to reorder the factor levels by using the random effects parameters after estimating the model, reorder the factor and estimate the model again. This is clumsy to say the least, but I was unable to get this done via some arguments in the plot method (I am not very familiar with lattice).
o <- order(re[, 1])
x$Subject <- factor(x$Subject, levels=levels(x$Subject)[o])
m <- lme(distance ~ age, x, random = ~ 1 | Subject)
re <- ranef(m)
plot(re)
This is what I want but without using the clumsy approach above.
How can I do this in a more sensible way?
I don't think teher is a parameter that can be used to change order levels. You should do it by hand .
That's said you can plot your own dotplot using the re object, and use reorder to order factor.
library(lattice)
dat = data.frame(x= row.names(re),y=re[,attr(re,'effectName')])
dotplot(reorder(x,y)~y,data=dat)

Plot each predictor variable from multivariate GLM versus response (other predictors held constant)

I can plot one predictor variable (from a mulitvariate logistic, binomial GLM) versus the predicted response. I do it like this:
m3 <- mtcars # example with mtcars
model = glm(vs~cyl+mpg+wt+disp+drat,family=binomial, data=m3)
newdata <- m3
newdata$cyl <- mean(m3$cyl)
newdata$mpg <- mean(m3$mpg)
newdata$wt <- mean(m3$wt)
newdata$disp <- mean(m3$disp)
newdata$drat <- m3$drat
newdata$vs <- predict(model, newdata = newdata, type = "response")
ggplot(newdata, aes(x = drat, y = vs)) + geom_line()
Above, drat vs vs with all other predictors held constant. However, I would to do this for each of the predictor variables, and doing the above process each time seems tedious. Is there a smarter way to do this? I'd like to visualize the response of each the different predictors and eventually, perhaps, at different constants.
Check the response.plot2 function in the biomod2 package. It was developed to create response curves for species distribution models but it essentially does what you need- it generates a multi pannel plot with responses for each variable used in your model. It also outputs the data into a data structure that can then be used to plot in whichever way you like.

Linear Regression in R with variable number of explanatory variables [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Specifying formula in R with glm without explicit declaration of each covariate
how to succinctly write a formula with many variables from a data frame?
I have a vector of Y values and a matrix of X values that I want to perform a multiple regression on (i.e. Y = X[column 1] + X[column 2] + ... X[column N])
The problem is that the number of columns in my matrix (N) is not prespecified. I know in R, to perform a linear regression you have to specify the equation:
fit = lm(Y~X[,1]+X[,2]+X[,3])
But how do I do this if I don't know how many columns are in my X matrix?
Thanks!
Three ways, in increasing level of flexibility.
Method 1
Run your regression using the formula notation:
fit <- lm( Y ~ . , data=dat )
Method 2
Put all your data in one data.frame, not two:
dat <- cbind(data.frame(Y=Y),as.data.frame(X))
Then run your regression using the formula notation:
fit <- lm( Y~. , data=dat )
Method 3
Another way is to build the formula yourself:
model1.form.text <- paste("Y ~",paste(xvars,collapse=" + "),collapse=" ")
model1.form <- as.formula( model1.form.text )
model1 <- lm( model1.form, data=dat )
In this example, xvars is a character vector containing the names of the variables you want to use.

Resources