So I'm trying to compare different linear models in order to determine if one is better than another. However I have several models, so I want to create an list of models and then call on them. Is that possible?
Models <- list(lm(y~a),lm(y~b),lm(y~c)
Models2 <- list(lm(y~a+b),lm(y~a+c),lm(y~b+c))
anova(Models2[1],Models[1])
Thank you for your help!
If you have two lists of models, and you want to compare each pair of models, then you want Map:
models1 <- list(lm(y ~ a), lm(y ~ b), lm(y ~ c)
models2 <- list(lm(y ~ a + b), lm(y ~ a + c), lm(y ~ b + c))
Map(anova, models1, models2)
This is basically equivalent to the following for loop:
out <- vector("list", length(models1))
for (i in seq_along(out) {
out[[i]] <- anova(models1[[i]], models2[[i]])
}
Map is an example of a functional, and you can find out more about them at https://github.com/hadley/devtools/wiki/Functionals
You can use do.call to convert a list of any length into a call suitable for a function taking .... The only trick here is that anova expects the first model to be named--that's what the Curry handles by creating a new function which already has its first argument specified.
Put everything except the first model (call it lm1) into one list called Models.
Then:
library(functional)
do.call( Curry(anova, object=lm1), Models )
Example:
> Models <- list( lm(runif(10)~rnorm(10)),lm(runif(10)~rnorm(10)),lm(runif(10)~rnorm(10)) )
> lm1 <- lm(runif(10)~rnorm(10))
> do.call( Curry(anova, object=lm1), Models )
Analysis of Variance Table
Model 1: runif(10) ~ rnorm(10)
Model 2: runif(10) ~ rnorm(10)
Model 3: runif(10) ~ rnorm(10)
Model 4: runif(10) ~ rnorm(10)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 8 0.46614
2 8 0.59522 0 -0.12908
3 8 1.00869 0 -0.41346
4 8 0.81686 0 0.19182
x <- rnorm(100,0,1)
y <- rnorm(100,5,2)
z <- rnorm(100,8,1)
models <- list(y.x = lm(y~x), y.z = lm(y~z))
anova(models[[1]],models[[2]])
This worked for me.
Related
I want to write function combinations_features(y, x) which go through all combinations containing three variables and will output r squared, adjusted r squared, AIC and BIC for each combination.
My solution
combinations_features <- function(y, x) {
# Define empty vectors to store statistics
feature_vec_1 <- feature_vec_2 <-
feature_vec_3 <- feature_vec_4 <- c()
# Obtaining all combinations containing three variables
comb_names <- utils::combn(colnames(x), 3)
# For each combination obtain wanted statistics
for (i in 1:ncol(comb_names)) {
feature_vec_1 <- append(
feature_vec_1, summary(lm(y ~ ., data = x[, comb_names[, i]]))$adj.r.squared
)
feature_vec_2 <- append(
feature_vec_2, summary(lm(y ~ ., data = x[, comb_names[, i]]))$r.squared
)
feature_vec_3 <- append(
feature_vec_3, AIC(lm(y ~ ., data = x[, comb_names[, i]]))
)
feature_vec_4 <- append(
feature_vec_4, BIC(lm(y ~ ., data = x[, comb_names[, i]]))
)
}
# Assign everything into data frame
data.frame(
"Adj R2" = feature_vec_1, "R2" = feature_vec_2,
"AIC" = feature_vec_3, "BIC" = feature_vec_4
)
}
Let's see how it works - define some artificial data and give it to the function.
set.seed(42)
predictors <- data.frame(rnorm(100), runif(100), rexp(100), rpois(100, 1))
dependent <- rnorm(100)
> combinations_features(dependent, predictors)
Adj.R2 R2 AIC BIC
1 -0.0283756015 0.002787295 276.2726 289.2985
2 0.0000677269 0.030368705 273.4678 286.4937
3 -0.0011990695 0.029140296 273.5944 286.6203
4 0.0015404392 0.031796789 273.3204 286.3463
However I find this code very inefficient due to these two things:
(1) Loop - I looped it over columns of matrices comb_names, I wonder if it can be omitted somehow
(2) Length of the code - This code is huge! Due to the fact that I define feature_vec for each statistics and append to them separately. I wonder if assigning to them can be done somehow by one command.
Could you please give me hand with improving my code by telling if it's possible to apply (1) or (2) ?
How about this, which relies on bind_rows() from tidyverse? I don't think there's a way to avoid looping over the combinations, but lapply makes everything a little neater, IMHO.
combinations_features1 <- function(y, x) {
comb_names <- utils::combn(colnames(x), 3)
bind_rows(
lapply(
1:ncol(comb_names),
function(z) {
m <- lm(y ~ ., data = x[, comb_names[,z]])
s <- summary(m)
tibble(Adj.R2=s$adj.r.squared, R2=s$r.squared, AIC=AIC(m), BIC=BIC(m))
}
)
)
}
combinations_features1(dependent, predictors)
# A tibble: 4 x 4
Adj.R2 R2 AIC BIC
<dbl> <dbl> <dbl> <dbl>
1 -0.0284 0.00279 276. 289.
2 0.0000677 0.0304 273. 286.
3 -0.00120 0.0291 274. 287.
4 0.00154 0.0318 273. 286.
bind_rows(), if given a list, binds the elements of the list into a single data.frame.
Same idea as above, just directly applying lapply to the list of combinations would also work:
combinations_features <- function(y,x){
do.call(rbind, lapply(utils::combn(colnames(x), 3, simplify=FALSE),
function(i){
f1 <- lm(y ~ ., data=x[, i])
data.frame(Adj.R2=summary(f1)$adj.r.squared,
R2=summary(f1)$r.squared,
AIC=AIC(f1), BIC=BIC(f1))
}))
}
I have a linear model with lots of explaining variables (independent variables)
model <- lm(y ~ x1 + x2 + x3 + ... + x100)
some of which are linear depended on each other (multicollinearity).
I want the machine to search for the name of the explaining variable which has the highest VIF coefficient (x2 for example), delete it from the formula and then run the old lm function with the new formula
model <- lm(y ~ x1 + x3 + ... + x100)
I already learned how to retrieve the name of the explaining variable which has the highest VIF coefficient:
max_vif <- function(x) {
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
But I still don't understand how to search the needed explaining variable, delete it from the formula and run the function again.
We can use the update function and paste in the column that needs to be removed. We first can fit a model, and then use update to change that model's formula. The model formula can be expressed as a character string, which allows you to concatenate the general formula .~. and whatever variable(s) you'd like removed (using the minus sign -).
Here is an example:
fit1 <- lm(wt ~ mpg + cyl + am, data = mtcars)
coef(fit1)
# (Intercept) mpg cyl am
# 4.83597190 -0.09470611 0.08015745 -0.52182463
rm_var <- "am"
fit2 <- update(fit1, paste0(".~. - ", rm_var))
coef(fit2)
# (Intercept) mpg cyl
# 5.07595833 -0.11908115 0.08625557
Using max_vif we can wrap this into a function:
rm_max_vif <- function(x){
# find variable(s) needing to be removed
rm_var <- max_vif(x)
# concatenate with "-" to remove variable(s) from formula
rm_var <- paste(paste0("-", rm_var), collapse = " ")
# update model
update(x, paste0(".~.", rm_var))
}
Problem solved!
I created a list containing all variables for lm model:
Price <- list(y,x1,...,x100)
Then I used different way for setting lm model:
model <- lm(y ~ ., data = Price)
So we can just delete variable with the highest VIF from Price list.
With the function i already came up the code will be:
Price <- list(y,x1,x2,...,x100)
model <- lm(y ~ ., data = Price)
max_vif <- function(x) { # Function for finding name of variable with the highest VIF
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
n <- max(data.frame(vif(model)))
while(n >= 5) { # Loop for deleting variable with the highest VIF from `Price` list one after another, untill there is no VIF equal or higher then 5
Price[[m]] <- NULL
model_auto <- lm(y ~ ., data = Price)
m <- max_vif(model)
n <- max(data.frame(vif(model)))
}
Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}
See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433
An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2
Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])
You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))
I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)
I am trying to refit a full model of class merMod with just the intercept (the null model). However, refitting using update.merMod gives a different answer than fitting the null model by hand, e.g.:
# Generate random data
set.seed(9)
dat <- data.frame(
x = do.call(c, lapply(1:5, function(x) rnorm(100, x))),
random = letters[1:5]
)
dat$y = rnbinom(500, mu = exp(dat$x), size = 1)
library(lme4)
# Get full model
full <- glmer.nb(y ~ x + (1 | random), dat)
# Write out intercept-only model by hand
null <- glmer.nb(y ~ 1 + (1 | random), dat)
# Update
null2 <- update(full, . ~ 1 -. + (1 | random))
VarCorr(null)
VarCorr(null2)
Any idea why this is an how I can use update to get the same vcov matrix?
Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:
linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)
That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:
for (i in seq_len(factor_number)) {
for (j in seq(i + 1, factor_number)) {
linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
F1=foo_data_frame[[i]],
F2=foo_data_frame[[j]]))
# linear_model further analyzing...
}
}
My question is how to do the same affect when the number of variables is changing dynamically during program running?
for (number_of_factors in seq_len(5)) {
# Then root over subsets with #number_of_factors cardinality.
for (factors_subset in all_subsets_with_fixed_cardinality) {
# Here I want to fit model with factors from factors_subset.
linear_model <- lm(Does R provide smth to write here?)
}
}
See ?as.formula, e.g.:
factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2
where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:
set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))
# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))
# Coefficients:
# (Intercept) factor1 factor2
# 0.542471 -0.002525 -0.147433
An oft forgotten function is reformulate. From ?reformulate:
reformulate creates a formula from a character vector.
A simple example:
listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')
will yield this formula:
y ~ factor1 + factor2
Although not explicitly documented, you can also add interaction terms:
listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors),
response = 'y')
will yield:
y ~ factor1 + factor2 + (factor3 + factor4)^2
Another option could be to use a matrix in the formula:
Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)
lm(Y ~ foo[,factors])
You don't actually need a formula. This works:
lm(data_frame[c("Y", "factor1", "factor2")])
as does this:
v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))
I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.
model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)