Get Index of variables from stepAIC - r

I am regressing a gene on another gene subset. Then I use stepAIC to reduce the number of explanatory genes. How do I get the index of the NON-omitted variables, so that I could analyse them?
gene_subset=c(y=genes[,i], genes[,other_genes]);
reduced_model=stepAIC(y~.,data=gene_subset,trace=false);

Here is one solution that I got from r-help mail list, any other more efficient ways would be welcome.
# create example data frame
y <- rnorm(30)
gene_subset <- data.frame(y, x1=rnorm(30), x2=rnorm(30), x3=100*y+rnorm(30))
# fit a full linear model
fit <- lm(y ~ ., df)
# reduce the model
reduced_model <- stepAIC(fit, trace=FALSE)
# NON-omitted variables (excluding the response)
keepx <- names(reduced_model$model)[-1]
index <- match(keepx, names(gene_subset))

Related

fitting linear regression models with different predictors using loops

I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)

Regarding multiple linear models simultaneously

I am trying to run many linear regressions models simultaneously. Please help me to make a code for this.
I am working on two data frames. In first data frame have 100 dependent variables and in second data frame i have 100 independent variables. Now I want simple linear models like
lm1 <- lm(data_frame_1[[1]] ~ data_frame_2[[1]])
lm2 <- lm(data_frame[[2]] ~ data_frame[[2]])
and so on .That means I have to run 100 regression models. I want to do this simultaneously. Please help me to make respective codes to run these all models simultaneously.
It is not that clear what you mean by simultaneously. But maybe doing a loop is fine in your case?
model.list = list()
for (i in 1:100){
model.list[[i]] = lm(data.frame.1[[i]] ~ data.frame2[[i]])
}
Using dataframe_1 and dataframe_2 defined in the Note at the end we define a function LM that takes an x name and y name and performs a regression of y on x using the columns from those data frames. The result is a list of lm objects. Note that the Call: line in the output of each output list component correctly identifies which columns were used.
LM <- function(xname, yname) {
fo <- formula(paste(yname, "~", xname))
do.call("lm", list(fo, quote(cbind(dataframe_1, dataframe_2))))
}
Map(LM, names(dataframe_1), names(dataframe_2))
giving:
$x1
Call:
lm(formula = y1 ~ x1, data = cbind(dataframe_1, dataframe_2))
Coefficients:
(Intercept) x1
3.0001 0.5001
... etc ...
Note
Using the builtin anscombe data frame define dataframe_1 as the x columns and data_frame_2 as the y columns.
dataframe_1 <- anscombe[grep("x", names(anscombe))]
dataframe_2 <- anscombe[grep("y", names(anscombe))]

test significance between models with emmeans

Let's say I have these two models
dat1 <- data.frame(x=factor(c(1,2,1,1,2,2)),y=c(2,5,2,1,7,9))
dat2 <- data.frame(x=factor(c(1,2,1,1,2,2)),y=c(3,3,4,3,4,2))
mod1 <- lm(y~x,data=dat1)
mod2 <- lm(y~x, data=dat2)
and calculate a t test between the levels of x in each model
t1 <- pairs(emmeans(mod1, ~x))
t2 <- pairs(emmeans(mod2, ~x))
How can I assess whether the two models are significantly different for this contrast using emmeans?
dat1$dataset <- "dat1"
dat2$dataset <- "dat2"
alldat <- rbind(dat1, dat2)
modsame <- lm(y ~ x, data = alldat)
moddiff <- lm(y ~ x * dataset, data = alldat)
anova(modsame, moddiff)
Don't try to use emmeans() to do this; that isn't its purpose. The anova() call above compares the two models: modsame presumes that the x effects are the same in each dataset; moddiff adds two terms, dataset which accounts for the change in overall mean, and x:dataset which accounts for the change in x effects.
The comparison between the two models comprises a joint test of both the dataset and the x:dataset effects -- it is an F test with 2 numerator d.f. -- not a t test.

Get list of R-squared values for linear regression model as we incrementally add predictors

I have a regression that predicts y based on 14 x-values (x1 through x14). I want to write a loop that does a regression where each iteration of the loop adds one more predictor to the regression, then tells me what the r-squared is. Here is my code:
rsqvals <- rep(NA, 15)
for (i in 1:15){
simtemp2 <- simdata[, 1:i]
modeL <- lm(y ~ ., data=simtemp2)
rsqvals[i] <- summary(modeL)$r.squared
}
where simdata is my data frame and simtemp2 is the columns I want. I suspect the problem has something to do with the fact that I can't type simdata[, 1:i], but I'm not sure why not. Any help appreciated!
It looks like you are subsetting the data.frame too much on the first iteration. In your first iteration, you would get simtemp2 <- simdata[,1:1]. The result of this operation is a vector in simtemp2. Even if you convert simtemp2 back into a data.frame, lm() will not like it as a parameter. Try starting at 2 and see if this works:
rsqvals <- rep(NA, 15)
interceptonly <- lm(y~1,data=simdata) ### no features, only the intercept
### this isn't statistically meaningful, but I put it here for completeness
rsqvals[1] <- summary(interceptonly)$r.squared
for (i in 2:15){
simtemp2 <- simdata[, 1:i]
modeL <- lm(y ~ ., data=simtemp2)
rsqvals[i] <- summary(modeL)$r.squared
}
print(rsqvals)

lm(): loop through multiple linear models exporting p-value of F-statistic

I have a large data set for which I need to run a linear model comparing groups.
I need to find the p-values for group comparisons using a linear model. There are four groups (so I need 1~2, 1~3. 1~4, 2~3, 2~4, 3~4) and and there are 130 columns for which the data from these groups needs to be compared. Any help would be greatly appreciated!!
I have this, which gives me exactly what I need.
fit<-lm(variable~group, data=data)
summary(fit)
However, with all of the groups and columns, I have nearly 800 comparisons to make, so I want to avoid doing this manually. I tried writing a for loop, but it isn't working.
k<-data.frame()
for (i in 1:130){
[i,1]<-colnames(data)
fit<- lm(i~group, data=data)
[i,2] <- fit$p.value
}
But this has given me a variety of different errors. I really just need the p-values. Help would be greatly greatly appreciated!! Thank you!
(2016-06-18) Your question is not completely answerable at this stage. In the following, I shall point out several problems.
How to get p-value properly
I assume you want p-value of F-statistic for the model, as an indication of goodness of fit. Suppose your fitted model is fit, we should do this way:
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
As an example, I will use built-in dataset trees as an demonstration.
fit <- lm(Height ~ Girth, trees)
## truncated output of summary(fit)
# > summary(fit)
# Residual standard error: 5.538 on 29 degrees of freedom
# Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445
F-statistic: 10.71 on 1 and 29 DF, p-value: 0.002758
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
## > p_value
# [1] 0.002757815
So, p_value agrees with the printed summary.
Your loop
I suggest you use vectors rather than data frame during computation/update.
variable <- character(130)
p.value <- numeric(130)
You can combine the results at the end to a data frame via:
k <- data.frame(var = variable, p.value = p.value)
Why? Because this is memory efficient! Now, after those correction, we arrive at:
variable <- character(130)
p.value <- numeric(130)
for (i in 1:130) {
variable[i] <- colnames(data)
fit <- lm(i~group, data=data)
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
p.value[i] <- p_value
}
k <- data.frame(var = variable, p.value = p.value)
Further problems
I still don't think the above code above will work. Because I am not sure whether the following is doing correct:
variable[i] <- colnames(data)
fit <- lm(i~group, data=data)
During the loop, data is not changed, so colnames(data) returns a vector, hence var[i] <- colnames(data) will trigger error.
i~group looks odd. Do you have i in your data?
I can't help you solve these issues. I have no idea of what your data looks like. But if you could put in a subset of your data, it would be OK.
Follow-up (2016-06-19)
Thank you. This has been extremely helpful. I don't have "i" in my data, but I was hoping that I could use that to represent the different column names, so that it goes through all of them. Is there a way to assign column names numbers so that this would work?
Yes, but I need to know what you have for each column.
Column 1 has a group number. The following columns have data for different factors I am looking at.
OK, so I think ncol(data) = 131, where the first column is group, and the remaining 130 columns are what you will test. Then this should work:
variable <- colnames(data)[-1]
p.value <- numeric(130)
for (i in 1:130) {
fit <- lm(paste(variable[i], "group", sep = "~"), data=data)
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
p.value[i] <- p_value
}
k <- data.frame(var = variable, p.value = p.value)
It is possible to use sapply() instead of the above for loop. But I think there is no performance difference, as loop overhead is so much tiny compared with lm() and summary().
I think this can get you started at least. It uses the dplyr and broom packages. The basic idea is to define all the formulas you want as characters then use lapply() to run them through lm().
library(dplyr)
library(broom)
# Generate a vector of wanted formulas
forms <- c("mpg ~ cyl", "mpg ~ wt")
# Function to apply formula
lmit <- function(form){
tidy(lm(as.formula(form), mtcars)) %>%
mutate(formula = form)
}
# Apply it and bind into a dataframe
results <- bind_rows(lapply(forms, lmit))

Resources