Residuals from R regression of a data subset - r

I would like to get regression residuals but only a residuals subset from data:
My R code:
reg = lm(Y ~ X1+X2+.....+Xn,data=fic)
step_reg = step(reg, direction= "both")
summary(step_reg)
fic is a dataframe with n columns called X1, X2, ...Xn.
To get all residuals: step_reg2$residuals
But I would like to get residuals only for rows which respect criteria like for example X1 = 'xxxx'
What could be the solution, please?

You can use the data you used for the regression to subset the residuals like:
reg <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=iris)
step_reg <- step(reg, direction= "both")
step_reg$residuals[iris$Species=="setosa"]
In case there are missing values:
x <- iris
x[1,2] <- NA
reg <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=x)
reg$residuals[iris[names(reg$residuals), "Species"] == "setosa"]

Related

Loop for multiple linear regression

Hi I’m starting to use r and am stuck on analyzing my data. I have a dataframe that has 80 columns. Column 1 is the dependent variable and from column 2 to 80 they are the independent variables. I want to perform 78 multiple linear regressions leaving the first independent variable of the model fixed (column 2) and create a list where I can to save all regressions to later be able to compare the models using AIC scores. how can i do it?
Here is my loop
data.frame
for(i in 2:80)
{
Regressions <- lm(data.frame$column1 ~ data.frame$column2 + data.frame [,i])
}
Using the iris dataset as an example you can do:
lapply(seq_along(iris)[-c(1:2)], function(x) lm(data = iris[,c(1:2, x)]))
[[1]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Petal.Length
2.2491 0.5955 0.4719
[[2]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Petal.Width
3.4573 0.3991 0.9721
[[3]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Speciesversicolor Speciesvirginica
2.2514 0.8036 1.4587 1.9468
This works because when you pass a dataframe to lm() without a formula it applies the function DF2formula() under the hood which treats the first column as the response and all other columns as predictors.
With the for loop we can initialize a list to store the output
nm1 <- names(df1)[2:80]
Regressions <- vector('list', length(nm1))
for(i in seq_along(Regressions)) {
Regressions[[i]] <- lm(reformulate(c("column2", nm1[i]), "column1"), data = df1)
}
Or use paste instead of reformulate
for(i in seq_along(Regressions)) {
Regressions[[i]] <- lm(as.formula(paste0("column1 ~ column2 + ",
nm1[i])), data = df1)
}
Using a reproducible example
nm2 <- names(iris)[3:5]
Regressions2 <- vector('list', length(nm2))
for(i in seq_along(Regressions2)) {
Regressions2[[i]] <- lm(reformulate(c("Sepal.Width", nm2[i]), "Sepal.Length"), data = iris)
}
Regressions2[[1]]
#Call:
#lm(formula = reformulate(c("Sepal.Width", nm2[i]), "Sepal.Length"),
# data = iris)
#Coefficients:
# (Intercept) Sepal.Width Petal.Length
# 2.2491 0.5955 0.4719

Extracting a list of R2 from within lm() based on variable in multiple regression in R

I've performed a multiple regression analysis on a dataset in R using lm() and I am able to extract the coefficients for each day of year using the function below. I would also like to extract the R2 for each day of year but this doesn't seem to work in the same way.
This is pretty much the same question as:
Print R-squared for all of the models fit with lmList
but when I try this I get 'Error: $ operator is invalid for atomic vectors'. I would also like to include it in the same function if possible. How can I extract the R2 for each doy in this way?
#Create MR function for extracting coefficients
getCoef <- function(df) {
coefs <- lm(y ~ T + P + L + T * L + P * L, data = df)$coef
names(coefs) <- c("intercept", "T", "P", "L", "T_L", "P_L")
coefs
}
#Extract coefficients for each doy
coefs.MR_uM <- ddply(MR_uM, ~ doy, getCoef)```
The point is r.squared is stored in summary(lm(...)) not in lm(...). Here is another version of your function to extract R2:
library(plyr)
df <- iris
#Create MR function for extracting coefficients and R2
getCoef <- function(df) {
model <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = df)
coefs <- model$coef
names(coefs) <- c("intercept", "Sepal.Width", "Petal.Length", "Petal.Width")
R2 <- summary(model)$r.squared
names(R2) <- c("R2")
c(coefs, R2)
}
#Extract coefficients and R2 for each Species
coefs.MR_uM <- ddply(df, ~ Species, getCoef)
coefs.MR_uM # output
Species intercept Sepal.Width Petal.Length Petal.Width R2
1 setosa 2.351890 0.6548350 0.2375602 0.2521257 0.5751375
2 versicolor 1.895540 0.3868576 0.9083370 -0.6792238 0.6050314
3 virginica 0.699883 0.3303370 0.9455356 -0.1697527 0.7652193
As suggested by Parfait, you don't need plyr::ddply(), you can use do.call(rbind, by(df, df$Species, getCoef))
Hope this helps !

How to plot decision boundaries for a Linear Discrimination Analysis plot in R using 3 input variables

I would like to plot the decision boundaries of LDA for a matrix with 3 input variables and 2 classes. I could find some code for plotting the boundaries if only 2 input variables are given to LDA, but the code I found for 3 input variables gives an incorrect boundary.
# With 2 input variables
attach(iris)
index=Species!="versicolor"
iris=iris[index,]
LDA <- lda(Species ~ Sepal.Length + Sepal.Width, data=iris)
GS <- 500
x1 <- seq(min(Sepal.Length), max(Sepal.Length), len=GS)
x2 <- seq(min(Sepal.Width), max(Sepal.Width), len=GS)
x <- expand.grid(x1, x2)
newdat <- data.frame(Sepal.Length=x[,1], Sepal.Width=x[,2])
lda.Ghat <- as.numeric(predict(LDA, newdata=newdat)$class)
plot(Sepal.Length,Sepal.Width,col=Species)
contour(x1, x2, matrix(lda.Ghat, GS,GS),
levels=c(1,2),add=TRUE,drawlabels=FALSE, col="red")
legend("topright",legend=c('setosa','virginica'),fill=c("black","green"))
# With 3 input variables
LDA <- lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length,data=iris)
GS <- 500
x1 <- seq(min(Sepal.Length), max(Sepal.Length), len=GS)
x2 <- seq(min(Sepal.Width), max(Sepal.Width), len=GS)
x <- expand.grid(x1, x2)
newdat <-data.frame(Sepal.Length=x[,1],Sepal.Width=x[,2],Petal.Length=mean(Petal.Length))
lda.Ghat <- as.numeric(predict(LDA, newdata=newdat)$class)
plot(Sepal.Length,Sepal.Width,col=Species)
contour(x1,x2,matrix(lda.Ghat,GS,GS),levels=c(1,2),add=TRUE,drawlabels=FALSE,col="red")
legend("topright",legend=c('setosa','virginica'),fill=c("black","green"))

constructing a model on a subset of a dataframe

consider for example the "iris" dataframe which is installed with main setup of R :
names(iris)
# [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
levels(iris$Species)
# [1] "setosa" "versicolor" "virginica"
now I construct three models without attaching the "iris":
t1=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris)
t2=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris[iris$Species=="setosa",])
t3=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris , subset = (iris$Species=="setosa"))
now i think t2=t3<>t1 but R says t1=t2<>t3. why I'm wrong?!!
now I construct again my models but this time with attaching the "iris":
attach(iris)
t1=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris)
t2=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris[iris$Species=="setosa",])
t3=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris , subset = (iris$Species=="setosa"))
now me and R both think: t2=t3<>t1. but again I'm confused because of the effect of attaching on model! I think first set of models is equivalent to second set of models, but R says no! thanks.
Its a scoping issue. If you do:
t1=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris)
t2=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris[iris$Species=="setosa",])
t3=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris , subset = (iris$Species=="setosa"))
You get the desired result.
coef(t1) == coef(t2)
(Intercept) iris$Sepal.Width iris$Petal.Length
FALSE FALSE FALSE
coef(t2) == coef(t3)
(Intercept) Sepal.Width Petal.Length
TRUE TRUE TRUE
When you say iris$Sepal.Length, R already knows where to look for that value. The subset argument is thus redundant and R ignores it. As mentioned in the comments, there is no need to use foo$bar when data = foo is supplied, and this situation looks to be a good example of why not to do so.
Two methods for conducting a linear model on a subset:
Creating the subset manually
setosa <- subset(iris, subset = Species == "setosa")
t1 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=setosa)
Using the subset argument in lm()
t2 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris, subset = Species == "setosa")
t1 and t2 are equivalent. However, if you use iris$ in the lm() call, R ignores what is passed to data (and possibly subset), since you are explicitly giving the vectors to the function rather than the dataframe. This is an incorrect way to use lm().

Easily performing the same regression on different datasets

I'm performing the same regression on several different datasets (same dependent and independe variables). However, there are many independent variables, and I often want to test adding/removing different variables. I'd like to avoid making all these changes to different lines of code, just because they use different datasets. Can I instead just copy the formula that was used to create some object, and then create a new object using a different dataset? For example, something like:
fit1 <- lm(y ~ x1 + x2 + x3 + ..., data = dataset1)
fit2 <- lm(fit1$call, data = dataset2) # this doesn't work
fit3 <- lm(fit1$call, data = dataset3) # this doesn't work
This way, if I want to update numerous regressions, I just update the first one and then rerun them all.
Can this be done? Preferably without using a loop or paste().
Thanks!
Or use update
(fit <- lm(mpg ~ wt, data = mtcars))
# Call:
# lm(formula = mpg ~ wt, data = mtcars)
#
# Coefficients:
# (Intercept) wt
# 37.285 -5.344
update(fit, data = mtcars[mtcars$hp < 100, ])
# Call:
# lm(formula = mpg ~ wt, data = mtcars[mtcars$hp < 100, ])
#
# Coefficients:
# (Intercept) wt
# 39.295 -5.379
update(fit, data = mtcars[1:10, ])
# Call:
# lm(formula = mpg ~ wt, data = mtcars[1:10, ])
#
# Coefficients:
# (Intercept) wt
# 33.774 -4.285
Collect your datasets into a list and then use lapply. E.g.:
dsets <- list(dataset1,dataset2,dataset3)
lapply(dsets, function(x) lm(y ~ x1 + x2, data=x) )
Not sure entirely that this what you want but you can do this as follows:
formula <- y ~ x1 + x2 + x3 + ...
fit1 <- lm(formula, data = dataset1)
fit2 <- lm(formula, data = dataset2)
fit3 <- lm(formula, data = dataset3)

Resources