loop of AIC criterion for optimal model selection - r

i have a xlsx file (data.xlsx) with 20 columns. The first column contains the dependent variable (Y) and all the others the independent ones (X1,X2,X3,...,X19). I want to create a loop that calculates the AIC for all possible combinations of Xi's with Y and prints the results. I have the following code that do not include the loop.
install.packages("readxl")
library(readxl)
data<-read_excel("data.xlsx")
data
lm1 <- lm(Y ~ . , data = data)
AIC(lm1) # calculate the AIC value for one model
Y <- data[,1]
Y
X <- data[,-1]
X
How can i create a loop to calculate the AIC of all combinations? Can somenone help me?

There would be a lot of possible combinations (524,477 if I calculated correctly). If your goal is to find the best model you should try using the stepAIC function from the MASS package with directions = both.

Related

Why is R removing some residuals and how to avoid it?

I am creating linear models in R and testing them for model assumptions.
I noticed that when I create my models, R removes some residuals, giving this:
(2 observations deleted due to missingness)
This prevents me from checking the relationship between the independent variable and the residuals and any further analysis because of the different lengths for x and y.
edit:
Do you have any ideas on how to fix this?
R isn't removing residuals when you run lm(). Rather, it cannot create residuals for samples that have any missing data in the model (nor actually use them in the analysis). Therefore, the summary(model_5) output notifies you that some samples (observations) cannot be used (i.e., are deleted).
To run a correlation between the residuals and the independent variable when there is a difference in their lengths, and when for some reason we cannot find the missing data to remove from the dataset (e.g., if dataset[!complete.cases(dataset), ] isn't working), we first need to figure another way to find which observations are kept/removed in the model. We may be able to rely on the observation ID or the dataset's row names for this.
Example
# sample data
set.seed(12345)
dataset <- data.frame(indep_var = c(NA, rnorm(9)), dep_var = c(rnorm(9), NA))
dataset$index <- rownames(dataset)
# model residuals
resid <- lm(data=dataset, dep_var ~ indep_var)$residuals
dataset.resid <- data.frame(index = names(resid), resid)
# join or match the residuals and the variables by their observation identifier
cor.data <- dplyr::inner_join(dataset.resid, dataset, by = "index")
# correlation analysis
cor.test(~ resid + indep_var, cor.data)
Note that names(resid) are the row names of the observations used in the model from dataset. Any unused rownames(dataset) or observations/rows from dataset (due to missingness) will not be included in names(resid).

Effect of an interaction term in a linear model

I am using the effects package to find the effect of variables in my linear model.
library(effects)
data(iris)
lm1 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length*Petal.Width,data=iris)
For a simple term in the model, I can get the effects for each data point using
effect("Sepal.Width", lm1, xlevels=iris['Sepal.Width'])
How can I get a similar 1-dimensional vector of values for my interaction term at each point? Does this even make sense? Everything thing I've tried is returning a 2-d matrix e.g.
effect("Petal.Length:Petal.Width", lm1 ,xlevels=iris['Petal.Length']*iris['Petal.Width'])
I'm not sure what should be used for the the xlevels argument in this case to give me more than just the default 5 equally spaced points.
Think I've figure out something which gives me what I want.
# Create dataframe with all possible combinations
eff_df <- data.frame(effect("Petal.Length:Petal.Width",lm1,xlevels=list(Petal.Length=iris$Petal.Length, Petal.Width=iris$Petal.Width)))
# Create column to merge on in eff_df
eff_df$merge_col <- paste0(eff_df$Petal.Length,eff_df$Petal.Width)
# Create column it will share in iris
iris$merge_col <- paste0(iris$Petal.Length,iris$Petal.Width)
# Only eff_df$fit values which correspond to a value in iris will be merged
iris <- merge(iris, eff_df[,c(7,3)], by="merge_col", all.x=T)
Then the effects vector is stored in iris$fit.

R: glmrob can't predict models with dropped co-linear columns, while glm can?

I'm learning to implement robust glms in R, but can't figure out why I am unable to get glmrob to predict values from my regression models when I have a model where some columns are dropped due to co-linearity. Specifically when I use the predict function to predict values from a glmrob, it always gives NA for all values. I don't observe this when predicting values from the same data & model using glm. It doesn't seem to matter what data I use -- as long as there is a NA coefficient in the fitted model (and the NA isn't the last coefficient in the coefficient vector), the predict does not work.
This behavior holds for all datasets and models I have tried where an internal column is dropped due to co-linearity. I include a fake data set where two columns are dropped from the model, which gives two NAs in the coefficient list. Both glm and glmrob give nearly identical coefficients, yet predict only works with the glm model. So my question is: what don't I understand about robust regression that would prevent my glmrob models from generating predicted values?
library(robustbase)
#Make fake data with two categorial predictors
df <- data.frame("category" = rep(c("A","B","C"),each=6))
df$location <- rep(1:6,each=3)
val <- rep(c(500,50,5000),each=6)+rep(c(50,100,25,200,100,1),each=3)
df$value <- rpois(NROW(df),val)
#note that predict works if we omit the newdata parameter. However I need the newdata param
#so I use the original dataframe here as a stand-in.
mod <- glm(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) # works fine
mod <- glmrob(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) #predicts NA for all values
I've been digging into this and have concluded that the problem does not lie in my understanding of robust regression, but rather the problem lies with a bug in the robustbase package. The predict.lmrob function does not correctly pick the necessary coefficients from the model before the prediction. It needs to pick the first x non-NA coefficients (where x=rank of the model matrix). Instead it merely picks the first x coefficients without checking if they are NA. This explains why this problem only surfaces for models where the NA isn't the last coefficient in the coefficient vector.
To fix this, I copied the predict.lmrob source using:
getAnywhere(predict.lmrob)
and created my own replacement function. In this function I made a single modification to the code:
...
p <- object$rank
if (is.null(p)) {
df <- Inf
p <- sum(!is.na(coef(object)))
#piv <- seq_len(p) # old code
piv <- which(!is.na(coef(object))) # new code
}
else {
p1 <- seq_len(p)
piv <- if (p)
qr(object)$pivot[p1]
}
...
I've run a few hundred datasets using this change and it has worked well.

Random predictions from linear model in R

I have some data with some missing values for one variable, and I want to be able to create (random) predictions for what these could be. Here's my first thought:
# miss indicates where the observations with missing response are
library(MASS)
model <- glm.nb(data[-miss,4] ~ ., data=data[-miss,-4])
predict(model, newdata=data[miss,-4])
However, if I repeat the last line, it gives the same answers over and over - it appears to give the predicted mean of responses given that data and the model. I want a random prediction which incorporates variance i.e. a random draw from the distribution of the response of an observation with such predictors under the given model.
It could have something to do with the pred.var argument, but I'm unsure how to use that.
Suppose we have data like this:
set.seed(101)
dd <- data.frame(x=(1:20)*0.1)
dd$y <- rnbinom(20,mu=exp(dd$x),size=1)
## make some missing values
miss <- c(2,3,5)
dd$y[miss] <- NA
Now fit a model:
m1 <- MASS::glm.nb(y~x,dd,na.action=na.exclude)
Now use predictions from that model to get the expected mean value and rnbinom to generate the random values ...
p <- predict(m1,newdata=dd,type="response")
randvals <- rnbinom(length(p),mu=p,size=m1$theta)
(This gives random values for every element, not just the missing ones, but obviously you can pick out just the ones you want ...) It would be nice if the simulate method did this, but it's not quite flexible enough ...

finding certain variable name from glm model summary in R

I have a glm model and I want to select the name of the variable whose coefficient has the highest p-value. I know how to find the highest p-value, and I know how to get the number of the variable (in the order in which it appears in the model), but I don't know how to actually get the variable name. The reason I would like to do this is that I want to create a loop that on each iteration removes the variable with the least significant coefficient and reruns the model. I would do that manually, but I just have way too many variables.
The following sample code could be of help. This code outputs the column name corresponding to Maximum P value (also, ignores the intercept)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
coefficients <- coef(summary(mylogit))
maxPColumn <- rownames(coefficients)[2:nrow(coefficients)][which.max(coefficients[2:nrow(coefficients),4])]
maxPColumn

Resources