Rank a list of models based on AIC values - r

After applying a model between one response variable and several exlanatory variables across a dataframe, I would like to rank each model by the AIC score.
I have encountered a very similar question that does exactly what I want to do.
Using lapply on a list of models, but it does not seem to work for me and I'm not sure why. Here's an example using the mtcars dataset:
lm_multiple <- lapply(mtcars[,-1], function(x) summary(lm(mtcars$mpg ~ x)))
An approved answer from the link above suggested:
sapply(X = lm_multiple, FUN = AIC)
But this does not work for me, I get this warning message.
Error in UseMethod("logLik") :
no applicable method for 'logLik' applied to an object of class "summary.lm"
Here is an answer from the original question...
x <- seq(1:10)
y <- sin(x)^2
model.list <- list(model1 = lm(y ~ x),
model2 = lm(y ~ x + I(x^2) + I(x^3)))
sapply(X = model.list, FUN = AIC)

you should remove the summary like this
lm_multiple <- lapply(mtcars[,-1], function(x) lm(mtcars$mpg ~ x))
sapply(X = lm_multiple, FUN = AIC)

Related

How to transpose result of colmeans in R to pass into linear regression function?

I have a model in R:
lm(formula = Y ~ rowmeans(df[, c(10:14)]), data=df)
I want to fit this against the means of columns 10:14 so that my output has 5 rows to predict the Y values.
I'm calculating the column means as follows:
t(as.data.frame.list(colMeans(df[, c(10:14)], na.rm=TRUE)))
This is the correct output. However, when passing this into my linear regression predict() function, I'm not receiving 5 rows as expected.
I'm trying:
fit <- lm(formula = Y ~ rowmeans(df[, c(10:14)]), data=df)
predict(fit, newdata = list(t(as.data.frame.list(colMeans(df[, c(10:14)],
na.rm = TRUE)))))
This produces an output of 300+ values...
Do not put complicated stuff in a model formula:
df$x <- rowmeans(df[, 10:14]) ## create variable 'x'
fit <- lm(Y ~ x, data = df)
## provide variable 'x'
newdf <- data.frame(x = colMeans(df[, 10:14], na.rm = TRUE))
predict(fit, newdf)

Iteratively adding variables to an lm() function in R?

Simple question, but I'm finding myself boggled.
I'm looking to make a loop that will continuously add variables to the IV of an lm() function. I would test the results of the LM until a condition is met. I'm just having trouble finding a way to dynamically adding variables to the IV part of the regression, one at a time.
The 1st iteration would look like:
lm(Y ~ X, data = data)
The second iteration like:
lm(Y ~ X + X2, data = data)
The third iteration like:
lm(Y ~ X + X2 + X3, data = data)
And so on...
If any of you could point me in the right direction, I'd appreciate it very much.
Thanks!
An alternative way is to use Y ~ . as the formula and provide the subset of data as required. Here, . means "all columns not otherwise in the formula" (see ?formula). Using mtcars as an example:
Y <- 'mpg'
Xs <- names(mtcars)[-1]
fits <- lapply(seq_along(Xs), function(x){
lm(paste(Y, '~ .'), data = mtcars[, c(Y, Xs[1:x])])
})
We can use reformulate to create the formula after passing the independent variables as a list
out <- lapply(list("X", c("X", "X2"), c("X", "X2", "X3")),
function(x) lm(reformulate(x, response = "Y"), data = data))
Or make it automated
Xs <- setdiff(names(data), "Y")
ind <- sequence(seq_along(Xs))
lapply(split(Xs[ind], cumsum(ind == 1)), function(x)
lm(reformulate(x, response = "Y"), data = data))

Potential bug in R's `polr` function when run from a function environment?

I may have found some sort of bug polr function (ordinal / polytomous regression) of the MASS library in R. The problem seems to be related to use of coef() on the summary object, but maybe is not.
The problem occurs in a function of type:
pol_me <- function(d){
some_x <- d[,1]
mod <- polr(some_x ~ d[,2])
pol_sum <- summary(mod)
return(pol_sum)
}
To illustrate, I simulate some data for an ordinal regression model.
set.seed(2016)
n=1000
x1 <- rnorm(n)
x2 <- 2*x1 +rnorm(n)
make_ord <- function(y){
y_ord <- y
y_ord[y < (-1)] <- 1
y_ord[y >= (-1) & y <= (1)] <- 2
y_ord[y >= 1] <- 3
y_ord <- as.factor(y_ord)
}
x1 <- make_ord(x1)
dat <- data.frame(x1,x2)
When we now call the function:
library(MASS)
pol_me(d = dat)
We get error
Error in eval(expr, envir, enclos) : object 'some_x' not found
I do not think this should logically happen at this point. In fact when we define alternative function in which the model command is replaced by a linear model lm on a numerical dependent variable, i.e.
mod <- lm(as.numeric(some_x) ~ d[,2])
The resulting function works fine.
Is this really a bug or a programming problem in my code and how can I get pol_me to run?
summary(polr(dat[,1] ~ dat[,2])) returns semi-error message Re-fitting to get Hessian and it's the cause of the error. polr's argument Hess = T will solve your problem. (?polr says Hess: logical for whether the Hessian (the observed information matrix) should be returned. Use this if you intend to call summary or vcov on the fit.)
pol_me <- function(d){
some_x <- d[,1]
mod <- polr(some_x ~ d[,2], Hess = T) # modify
pol_sum <- summary(mod)
return(pol_sum)
}

predict method for felm from lfe package

Does anyone have a nice clean way to get predict behavior for felm models?
library(lfe)
model1 <- lm(data = iris, Sepal.Length ~ Sepal.Width + Species)
predict(model1, newdata = data.frame(Sepal.Width = 3, Species = "virginica"))
# Works
model2 <- felm(data = iris, Sepal.Length ~ Sepal.Width | Species)
predict(model2, newdata = data.frame(Sepal.Width = 3, Species = "virginica"))
# Does not work
UPDATE (2020-04-02): The answer from Grant below using the new package fixest provides a more parsimonious solution.
As a workaround, you could combine felm, getfe, and demeanlist as follows:
library(lfe)
lm.model <- lm(data=demeanlist(iris[, 1:2], list(iris$Species)), Sepal.Length ~ Sepal.Width)
fe <- getfe(felm(data = iris, Sepal.Length ~ Sepal.Width | Species))
predict(lm.model, newdata = data.frame(Sepal.Width = 3)) + fe$effect[fe$idx=="virginica"]
The idea is that you use demeanlist to center the variables, then lm to estimate the coefficient on Sepal.Width using the centered variables, giving you an lm object over which you can run predict. Then run felm+getfe to get the conditional mean for the fixed effect, and add that to the output of predict.
Late to the party, but the new fixest package (link) has a predict method. It supports high-dimensional fixed effects (and clustering, etc.) using a very similar syntax to lfe. Somewhat remarkably, it is also considerably faster than lfe for the benchmark cases that I've tested.
library(fixest)
model_feols <- feols(data = iris, Sepal.Length ~ Sepal.Width | Species)
predict(model_feols, newdata = data.frame(Sepal.Width = 3, Species = "virginica"))
# Works
This might not be the answer that you are looking for, but it seems that the author did not add any functionality to the lfe package in order to make predictions on external data by using the fitted felm model. The primary focus seems to be on the analysis of the group fixed effects. However, it's interesting to note that in the documentation of the package the following is mentioned:
The object has some resemblance to an 'lm' object, and some
postprocessing methods designed for lm may happen to work. It may
however be necessary to coerce the object to succeed with this.
Hence, it might be possible to coerce the felm object to an lm object in order to obtain some additional lm functionality (if all the required info is present in the object to perform the necessary computations).
The lfe package is intended to be run on very large datasets and effort was made to conserve memory: As a direct result of this, the felm object does not use/contain a qr decomposition, as opposed to the lm object. Unfortunately, the lm predict procedure relies on this information in order to compute the predictions. Hence, coercing the felm object and executing the predict method will fail:
> model2 <- felm(data = iris, Sepal.Length ~ Sepal.Width | Species)
> class(model2) <- c("lm","felm") # coerce to lm object
> predict(model2, newdata = data.frame(Sepal.Width = 3, Species = "virginica"))
Error in qr.lm(object) : lm object does not have a proper 'qr' component.
Rank zero or should not have used lm(.., qr=FALSE).
If you really must use this package to perform the predictions then you could maybe write your own simplified version of this functionality by using the information that you have available in the felm object. For example, the OLS regression coƫfficients are available via model2$coefficients.
This should work for cases where you wish to ignore the group effects in the prediction, are predicting for new X's, and only want confidence intervals. It first looks for a clustervcv attribute, then robustvcv, then vcv.
predict.felm <- function(object, newdata, se.fit = FALSE,
interval = "none",
level = 0.95){
if(missing(newdata)){
stop("predict.felm requires newdata and predicts for all group effects = 0.")
}
tt <- terms(object)
Terms <- delete.response(tt)
attr(Terms, "intercept") <- 0
m.mat <- model.matrix(Terms, data = newdata)
m.coef <- as.numeric(object$coef)
fit <- as.vector(m.mat %*% object$coef)
fit <- data.frame(fit = fit)
if(se.fit | interval != "none"){
if(!is.null(object$clustervcv)){
vcov_mat <- object$clustervcv
} else if (!is.null(object$robustvcv)) {
vcov_mat <- object$robustvcv
} else if (!is.null(object$vcv)){
vcov_mat <- object$vcv
} else {
stop("No vcv attached to felm object.")
}
se.fit_mat <- sqrt(diag(m.mat %*% vcov_mat %*% t(m.mat)))
}
if(interval == "confidence"){
t_val <- qt((1 - level) / 2 + level, df = object$df.residual)
fit$lwr <- fit$fit - t_val * se.fit_mat
fit$upr <- fit$fit + t_val * se.fit_mat
} else if (interval == "prediction"){
stop("interval = \"prediction\" not yet implemented")
}
if(se.fit){
return(list(fit=fit, se.fit=se.fit_mat))
} else {
return(fit)
}
}
To extend the answer from pbaylis, I created a slightly longwinded function that extends nicely to allow for more than one fixed effect. Note that you have to manually enter the original dataset used in the felm model. The function returns a list with two items: the vector of predictions, and a dataframe based on the new_data that includes the predictions and fixed effects as columns.
predict_felm <- function(model, data, new_data) {
require(dplyr)
# Get the names of all the variables
y <- model$lhs
x <- rownames(model$beta)
fe <- names(model$fe)
# Demean according to fixed effects
data_demeaned <- demeanlist(data[c(y, x)],
as.list(data[fe]),
na.rm = T)
# Create formula for LM and run prediction
lm_formula <- as.formula(
paste(y, "~", paste(x, collapse = "+"))
)
lm_model <- lm(lm_formula, data = data_demeaned)
lm_predict <- predict(lm_model,
newdata = new_data)
# Collect coefficients for fe
fe_coeffs <- getfe(model) %>%
select(fixed_effect = effect, fe_type = fe, idx)
# For each fixed effect, merge estimated fixed effect back into new_data
new_data_merge <- new_data
for (i in fe) {
fe_i <- fe_coeffs %>% filter(fe_type == i)
by_cols <- c("idx")
names(by_cols) <- i
new_data_merge <- left_join(new_data_merge, fe_i, by = by_cols) %>%
select(-matches("^idx"))
}
if (length(lm_predict) != nrow(new_data_merge)) stop("unmatching number of rows")
# Sum all the fixed effects
all_fixed_effects <- base::rowSums(select(new_data_merge, matches("^fixed_effect")))
# Create dataframe with predictions
new_data_predict <- new_data_merge %>%
mutate(lm_predict = lm_predict,
felm_predict = all_fixed_effects + lm_predict)
return(list(predict = new_data_predict$felm_predict,
data = new_data_predict))
}
model2 <- felm(data = iris, Sepal.Length ~ Sepal.Width | Species)
predict_felm(model = model2, data = iris, new_data = data.frame(Sepal.Width = 3, Species = "virginica"))
# Returns prediction and data frame
I think what you're looking for might be the lme4 package. I was able to get a predict to work using this:
library(lme4)
data(iris)
model2 <- lmer(data = iris, Sepal.Length ~ (Sepal.Width | Species))
predict(model2, newdata = data.frame(Sepal.Width = 3, Species = "virginica"))
1
6.610102
You may have to play around a little to specify the particular effects you're looking for, but the package is well-documented so it shouldn't be a problem.

R: Regression of each variable depending on all the others

In R, I have the following data.frame:
df <- data.frame(var1,var2,var3)
I would like to fit a regression function, like multinom, for each variable with respect to the others, without using the variable names explicitely. In other words, I would like to obtain this result:
fit1 <- multinom(var1 ~ ., data=df)
fit2 <- multinom(var2 ~ ., data=df)
fit3 <- multinom(var3 ~ ., data=df)
But in a for loop, without using the variable names (so that I can use the same code for any data.frame). Something similar to this:
for (i in colnames(df))
{
fit[i] <- lm(i ~ ., data=df)
}
(This code does not work.)
Maybe my question is trivial, but I have no idea on how to proceed.
Thanks!
You need to add an extra step to build the formula object using string operation
fit <- vector(mode = "list", length = ncol(df))
for (i in colnames(df)) {
fm <- as.formula(paste0(i, " ~ ."))
fit[[i]] <- lm(fm, data = df)
}

Resources