Auto VIF (Variable importance factor) for variable analysis - r

I want to calculate an "Auto" VIF for variables, one vs the others. For example in the iris dataset, I want Sepal.Width to act as target value and the others as explanatory variables.
First I remove the Species column, so only the variables stay. Then I want to loop over each variable and test it again the others. Finally I want the VIF resulto to be stored in a list.
This is what I have tried:
library(car)
library(dplyr)
library(tidyr)
iris_clean <- iris %>%
select(-Species)
col_names <- colnames(iris)
i <- 1
for(col in col_names) {
regr <- lm(col ~ ., data=iris_clean)
list[i] <- vif(regr)
i <- i+1
}
For some reason I get an error:
Error in model.frame.default(formula = col ~ ., data = iris_clean, drop.unused.levels = TRUE) :
variable lengths differ (found for 'Sepal.Length')
Which I don't understand because the variables have the same length. Please, any help will be greatly appreciated.

You can try this -
library(car)
library(dplyr)
library(tidyr)
iris_clean <- iris %>% select(-Species)
col_names <- colnames(iris_clean)
result <- vector('list', length(col_names))
for(i in seq_along(col_names)) {
regr <- lm(paste0(col_names[i], '~ .'), data=iris_clean)
result[[i]] <- vif(regr)
}
result
#[[1]]
# Sepal.Width Petal.Length Petal.Width
# 1.270815 15.097572 14.234335
#[[2]]
#Sepal.Length Petal.Length Petal.Width
# 4.278282 19.426391 14.089441
#[[3]]
#Sepal.Length Sepal.Width Petal.Width
# 3.415733 1.305515 3.889961
#[[4]]
#Sepal.Length Sepal.Width Petal.Length
# 6.256954 1.839639 7.557780

Related

Loop for multiple linear regression

Hi I’m starting to use r and am stuck on analyzing my data. I have a dataframe that has 80 columns. Column 1 is the dependent variable and from column 2 to 80 they are the independent variables. I want to perform 78 multiple linear regressions leaving the first independent variable of the model fixed (column 2) and create a list where I can to save all regressions to later be able to compare the models using AIC scores. how can i do it?
Here is my loop
data.frame
for(i in 2:80)
{
Regressions <- lm(data.frame$column1 ~ data.frame$column2 + data.frame [,i])
}
Using the iris dataset as an example you can do:
lapply(seq_along(iris)[-c(1:2)], function(x) lm(data = iris[,c(1:2, x)]))
[[1]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Petal.Length
2.2491 0.5955 0.4719
[[2]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Petal.Width
3.4573 0.3991 0.9721
[[3]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Speciesversicolor Speciesvirginica
2.2514 0.8036 1.4587 1.9468
This works because when you pass a dataframe to lm() without a formula it applies the function DF2formula() under the hood which treats the first column as the response and all other columns as predictors.
With the for loop we can initialize a list to store the output
nm1 <- names(df1)[2:80]
Regressions <- vector('list', length(nm1))
for(i in seq_along(Regressions)) {
Regressions[[i]] <- lm(reformulate(c("column2", nm1[i]), "column1"), data = df1)
}
Or use paste instead of reformulate
for(i in seq_along(Regressions)) {
Regressions[[i]] <- lm(as.formula(paste0("column1 ~ column2 + ",
nm1[i])), data = df1)
}
Using a reproducible example
nm2 <- names(iris)[3:5]
Regressions2 <- vector('list', length(nm2))
for(i in seq_along(Regressions2)) {
Regressions2[[i]] <- lm(reformulate(c("Sepal.Width", nm2[i]), "Sepal.Length"), data = iris)
}
Regressions2[[1]]
#Call:
#lm(formula = reformulate(c("Sepal.Width", nm2[i]), "Sepal.Length"),
# data = iris)
#Coefficients:
# (Intercept) Sepal.Width Petal.Length
# 2.2491 0.5955 0.4719

Extracting a list of R2 from within lm() based on variable in multiple regression in R

I've performed a multiple regression analysis on a dataset in R using lm() and I am able to extract the coefficients for each day of year using the function below. I would also like to extract the R2 for each day of year but this doesn't seem to work in the same way.
This is pretty much the same question as:
Print R-squared for all of the models fit with lmList
but when I try this I get 'Error: $ operator is invalid for atomic vectors'. I would also like to include it in the same function if possible. How can I extract the R2 for each doy in this way?
#Create MR function for extracting coefficients
getCoef <- function(df) {
coefs <- lm(y ~ T + P + L + T * L + P * L, data = df)$coef
names(coefs) <- c("intercept", "T", "P", "L", "T_L", "P_L")
coefs
}
#Extract coefficients for each doy
coefs.MR_uM <- ddply(MR_uM, ~ doy, getCoef)```
The point is r.squared is stored in summary(lm(...)) not in lm(...). Here is another version of your function to extract R2:
library(plyr)
df <- iris
#Create MR function for extracting coefficients and R2
getCoef <- function(df) {
model <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = df)
coefs <- model$coef
names(coefs) <- c("intercept", "Sepal.Width", "Petal.Length", "Petal.Width")
R2 <- summary(model)$r.squared
names(R2) <- c("R2")
c(coefs, R2)
}
#Extract coefficients and R2 for each Species
coefs.MR_uM <- ddply(df, ~ Species, getCoef)
coefs.MR_uM # output
Species intercept Sepal.Width Petal.Length Petal.Width R2
1 setosa 2.351890 0.6548350 0.2375602 0.2521257 0.5751375
2 versicolor 1.895540 0.3868576 0.9083370 -0.6792238 0.6050314
3 virginica 0.699883 0.3303370 0.9455356 -0.1697527 0.7652193
As suggested by Parfait, you don't need plyr::ddply(), you can use do.call(rbind, by(df, df$Species, getCoef))
Hope this helps !

Comparing percent change of model coefficients

I am working through step 3 of purposeful model-building from Hosmer-Lemeshow and it suggests to compare the percent change in coefficients between a full model [Iris.mod1] and a reduced model [Iris.mod2]. I would like to automate this step if possible.
Right now I have the following code:
#Make species a binomial DV
iris = subset(iris, iris$Species != 'virginica')
iris$Species = as.numeric(ifelse(iris$Species == 'setosa', 1, 0))
#Build models
Iris.mod1 = glm(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,
data = iris, family = binomial())
Iris.mod2 = glm(Species~Sepal.Length+Petal.Length, data = iris, family =
binomial())
The dataset I am actually using has about 93 variables and 1.7 million rows. But I am using the iris data just for this example.
#Try to see if any coefficients changed by > 20%
paste(names(which((summary(Iris.mod1)$coefficients[2:
(nrow(summary(Iris.mod1)$coefficients)),1] -
(summary(Iris.mod2)$coefficients[2:
(nrow(summary(Iris.mod2)$coefficients)),1]/
(summary(Iris.mod1)$coefficients[2:nrow(summary(Iris.mod1)$coefficients)),1]
> 0.2 == TRUE)))))
However, this code is full of errors and I am lost in a sea of parenthesis.
Is there an efficient way to determine which variables coefficient changed by more than 20%?
Thank you in advance.
The broom package is really nice for making data frames of model coefficients and terms. We can use that to get things in a workable format:
library(broom)
m_list = list(m1 = Iris.mod1, m2 = Iris.mod2)
t_list = lapply(m_list, tidy)
library(dplyr)
library(tidyr)
bind_rows(t_list, .id = "mod") %>%
select(term, estimate, mod) %>%
spread(key = mod, value = estimate) %>%
mutate(p_change = (m2 - m1) / m1 * 100,
p_change_gt_20 = p_change > 20)
# term m1 m2 p_change p_change_gt_20
# 1 (Intercept) -6.556265 -65.84266 904.2709 TRUE
# 2 Petal.Length -19.053588 -49.04616 157.4117 TRUE
# 3 Petal.Width -25.032928 NA NA NA
# 4 Sepal.Length 9.878866 37.56141 280.2199 TRUE
# 5 Sepal.Width 7.417640 NA NA NA

constructing a model on a subset of a dataframe

consider for example the "iris" dataframe which is installed with main setup of R :
names(iris)
# [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
levels(iris$Species)
# [1] "setosa" "versicolor" "virginica"
now I construct three models without attaching the "iris":
t1=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris)
t2=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris[iris$Species=="setosa",])
t3=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris , subset = (iris$Species=="setosa"))
now i think t2=t3<>t1 but R says t1=t2<>t3. why I'm wrong?!!
now I construct again my models but this time with attaching the "iris":
attach(iris)
t1=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris)
t2=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris[iris$Species=="setosa",])
t3=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris , subset = (iris$Species=="setosa"))
now me and R both think: t2=t3<>t1. but again I'm confused because of the effect of attaching on model! I think first set of models is equivalent to second set of models, but R says no! thanks.
Its a scoping issue. If you do:
t1=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris)
t2=lm(Sepal.Length ~ Sepal.Width + Petal.Length , data=iris[iris$Species=="setosa",])
t3=lm(iris$Sepal.Length ~ iris$Sepal.Width + iris$Petal.Length , data=iris , subset = (iris$Species=="setosa"))
You get the desired result.
coef(t1) == coef(t2)
(Intercept) iris$Sepal.Width iris$Petal.Length
FALSE FALSE FALSE
coef(t2) == coef(t3)
(Intercept) Sepal.Width Petal.Length
TRUE TRUE TRUE
When you say iris$Sepal.Length, R already knows where to look for that value. The subset argument is thus redundant and R ignores it. As mentioned in the comments, there is no need to use foo$bar when data = foo is supplied, and this situation looks to be a good example of why not to do so.
Two methods for conducting a linear model on a subset:
Creating the subset manually
setosa <- subset(iris, subset = Species == "setosa")
t1 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=setosa)
Using the subset argument in lm()
t2 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data=iris, subset = Species == "setosa")
t1 and t2 are equivalent. However, if you use iris$ in the lm() call, R ignores what is passed to data (and possibly subset), since you are explicitly giving the vectors to the function rather than the dataframe. This is an incorrect way to use lm().

Using neural networks neuralnet in R to predict factor values

I am using neuralnet package, use several inputs to predict an output.
Originally, my output is a factor variable, and I saw the error:
Error in neurons[[i]] %*% weights[[i]] :
requires numeric/complex matrix/vector arguments
When I converted the output to numeric variable, the error disappeared. Is there a way to neural network with factor output?
I adapted code that I found at this site, which uses the iris dataset with the neuralnet package to predict iris species from the morphological data.
Without a reproducible example, I'm not sure if this applies to your case. The key here was to convert the factorial response level to its own binary variable. The prediction is a bit different than other models in R - you choose the factor level with the highest score.
Example code:
library(neuralnet)
# Make training and validation data
set.seed(1)
train <- sample(nrow(iris), nrow(iris)*0.5)
valid <- seq(nrow(iris))[-train]
iristrain <- iris[train,]
irisvalid <- iris[valid,]
# Binarize the categorical output
iristrain <- cbind(iristrain, iristrain$Species == 'setosa')
iristrain <- cbind(iristrain, iristrain$Species == 'versicolor')
iristrain <- cbind(iristrain, iristrain$Species == 'virginica')
names(iristrain)[6:8] <- c('setosa', 'versicolor', 'virginica')
# Fit model
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain,
hidden=c(3)
)
plot(nn)
# Predict
comp <- compute(nn, irisvalid[-5])
pred.weights <- comp$net.result
idx <- apply(pred.weights, 1, which.max)
pred <- c('setosa', 'versicolor', 'virginica')[idx]
table(pred, irisvalid$Species)
#pred setosa versicolor virginica
# setosa 23 0 0
# versicolor 1 21 7
# virginica 0 1 22
This might raise warnings:
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain,
hidden=c(3)
)
So replace it with:
nn <- neuralnet(
setosa+versicolor+virginica ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iristrain, hidden = 3,lifesign = "full")
If this does not work:
comp <- compute(nn, irisvalid[-5])
then use
comp <- neuralnet::compute(nn, irisvalid[,1:4])

Resources