I have figured out how to make a table in R with 4 variables, which I am using for multiple linear regressions. The dependent variable (Lung) for each regression is taken from one column of a csv table of 22,000 columns. One of the independent variables (Blood) is taken from a corresponding column of a similar table.
Each column represents the levels of a particular gene, which is why there are so many of them. There are also two additional variables (Age and Gender of each patient). When I enter in the linear regression equation, I use lm(Lung[,1] ~ Blood[,1] + Age + Gender), which works for one gene.
I am looking for a way to input this equation and have R calculate all of the remaining columns for Lung and Blood, and hopefully output the coefficients into a table.
Any help would be appreciated!
You want to run 22,000 linear regressions and extract the coefficients? That's simple to do from a coding standpoint.
set.seed(1)
# number of columns in the Lung and Blood data.frames. 22,000 for you?
n <- 5
# dummy data
obs <- 50 # observations
Lung <- data.frame(matrix(rnorm(obs*n), ncol=n))
Blood <- data.frame(matrix(rnorm(obs*n), ncol=n))
Age <- sample(20:80, obs)
Gender <- factor(rbinom(obs, 1, .5))
# run n regressions
my_lms <- lapply(1:n, function(x) lm(Lung[,x] ~ Blood[,x] + Age + Gender))
# extract just coefficients
sapply(my_lms, coef)
# if you need more info, get full summary call. now you can get whatever, like:
summaries <- lapply(my_lms, summary)
# ...coefficents with p values:
lapply(summaries, function(x) x$coefficients[, c(1,4)])
# ...or r-squared values
sapply(summaries, function(x) c(r_sq = x$r.squared,
adj_r_sq = x$adj.r.squared))
The models are stored in a list, where model 3 (with DV Lung[, 3] and IVs Blood[,3] + Age + Gender) is in my_lms[[3]] and so on. You can use apply functions on the list to perform summaries, from which you can extract the numbers you want.
The question seems to be about how to call regression functions with formulas which are modified inside a loop.
Here is how you can do it in (using diamonds dataset):
attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)
formula <- list(); model <- list()
for (i in 1:1) {
formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
model[[i]] = glm(formula[[i]])
#then you can plot or do anything else with the result ...
png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
par(mfrow = c(2, 2))
plot(model[[i]])
dev.off()
}
Sensible or not, to make the loop at least somehow work you need:
y<- c(1,5,6,2,5,10) # response
x1<- c(2,12,8,1,16,17) # predictor
x2<- c(2,14,5,1,17,17)
predictorlist<- list("x1","x2")
for (i in predictorlist){
model <- lm(paste("y ~", i[[1]]), data=df)
print(summary(model))
}
The paste function will solve the problem.
A tidyverse addition - with map()
Another way - using map2() from the purrr package:
library(purrr)
xs <- anscombe[,1:3] # Select variables of interest
ys <- anscombe[,5:7]
map2_df(ys, xs,
function(i,j){
m <- lm(i ~j + x4 , data = anscombe)
coef(m)
})
The output is a dataframe (tibble) of all coefficients:
`(Intercept)` j x4
1 4.33 0.451 -0.0987
2 6.42 0.373 -0.253
3 2.30 0.526 0.0518
If more variables are changing this can be done using the pmap() functions
Related
Im trying to create AIC scores for several different models in a for loop.
I have created a for loop with the log likeliness for each model. However, I am stuck to create the lm function so that it calculates a model for each combination of my column LOGABUNDANCE with columns 4 to 11 of my dataframe.
This is the code I have used so far. But that gives me a similar AIC score for every model.
# AIC score for every model
LL <- rep(NA, 10)
AIC <- rep(NA, 10)
for(i in 1:10){
mod <- lm(LOGABUNDANCE ~ . , data = butterfly)
sigma = as.numeric(summary(mod)[6])
LL[i] <- sum(log(dnorm(butterfly$LOGABUNDANCE, predict(mod), sigma)))
AIC[i] <- -2*LL[i] + 2*(2)
}
You get the same AIC for every model, because you create 10 equal models.
To make the code work, you need some way of changing the model in each iteration.
I can see two options:
Either subset the data in the start of each iteration so it only contains LOGABUNDANCE and one other variable (as suggested by #yacine-hajji in the comments), or
Create a vector of the variables you want to create models with, and use as.formula() together with paste0() to create a new formula for each iteration.
I think solution 2 is easier. Here is a working example of solution 2, using mtcars:
# AIC score for every model
LL <- rep(NA, 10)
AIC <- rep(NA, 10)
# Say I want to model all variables against `mpg`:
# Create a vector of all variable names except mpg
variables <- names(mtcars)[-1]
for(i in 1:10){
# Note how the formula is different in each iteration
mod <- lm(
as.formula(paste0("mpg ~ ", variables[i])),
data = mtcars
)
sigma = as.numeric(summary(mod)[6])
LL[i] <- sum(log(dnorm(mtcars$mpg, predict(mod), sigma)))
AIC[i] <- -2*LL[i] + 2*(2)
}
Output:
AIC
#> [1] 167.3716 168.2746 179.3039 188.8652 164.0947 202.6534 190.2124 194.5496
#> [9] 200.4291 197.2459
I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)
I am running lots of linear models on data from different experiments where we want to exclude a lag time from the start of the experiment. This lag time may vary between runs and is very obvious in the example plot below. Is there a robust way to automatically exclude the lag time, in my example below it would be where x < 1
I thought the way to do it would be to produce linear models and gradually remove points from the start of the data and compare models but I don't know the best way to compare models from subsetted data
df <- data.frame (x1 = c(0.7,1.7,2.8,3.7,4.9,6.0,6.7,7.7,8.7,9.7,10.7,12.0,13.1),
y1 = c(22.7,50.7,103.2,143.4,175.2,216.8,234.1,246.6,256.0,266.2,276.0,287.6,295.5))
plot(1/df$x1, log(df$y1), type = "l")
summary(lm(log(y1) ~ I(1/x1), data = df))
summary(lm(log(y1) ~ I(1/x1), data = df[df$x1 > 1,]))
summary(lm(log(y1) ~ I(1/x1), data = df[df$x1 > 2,]))
i. generate list of 14 data.frame, where v = 0 to maximum integer value of df$x1
library(dplyr)
all.dat <- lapply(max(df$x1) %>% seq(from =0, to=.), function(v) df[df$x1 > v,])
ii. generate list of lm models using 14 data.frames
lm.form <- as.formula("log(y1) ~ I(1/x1)")
all.lm <- lapply(all.dat, function(x)lm(lm.form, data=x))
iii. view summary of all 14 lm models
lapply(all.lm, summary)
sapply(all.lm, function(x)summary(x)$r.sq) #extract r.sq value for all models
[1] 0.9074019 0.9960153 0.9957543 0.9903552 0.9783031 0.9937000 0.9899247 0.9915223 0.9982270 0.9997441 0.9998207 1.0000000 0.0000000 0.0000000
I have a large data set for which I need to run a linear model comparing groups.
I need to find the p-values for group comparisons using a linear model. There are four groups (so I need 1~2, 1~3. 1~4, 2~3, 2~4, 3~4) and and there are 130 columns for which the data from these groups needs to be compared. Any help would be greatly appreciated!!
I have this, which gives me exactly what I need.
fit<-lm(variable~group, data=data)
summary(fit)
However, with all of the groups and columns, I have nearly 800 comparisons to make, so I want to avoid doing this manually. I tried writing a for loop, but it isn't working.
k<-data.frame()
for (i in 1:130){
[i,1]<-colnames(data)
fit<- lm(i~group, data=data)
[i,2] <- fit$p.value
}
But this has given me a variety of different errors. I really just need the p-values. Help would be greatly greatly appreciated!! Thank you!
(2016-06-18) Your question is not completely answerable at this stage. In the following, I shall point out several problems.
How to get p-value properly
I assume you want p-value of F-statistic for the model, as an indication of goodness of fit. Suppose your fitted model is fit, we should do this way:
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
As an example, I will use built-in dataset trees as an demonstration.
fit <- lm(Height ~ Girth, trees)
## truncated output of summary(fit)
# > summary(fit)
# Residual standard error: 5.538 on 29 degrees of freedom
# Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445
F-statistic: 10.71 on 1 and 29 DF, p-value: 0.002758
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
## > p_value
# [1] 0.002757815
So, p_value agrees with the printed summary.
Your loop
I suggest you use vectors rather than data frame during computation/update.
variable <- character(130)
p.value <- numeric(130)
You can combine the results at the end to a data frame via:
k <- data.frame(var = variable, p.value = p.value)
Why? Because this is memory efficient! Now, after those correction, we arrive at:
variable <- character(130)
p.value <- numeric(130)
for (i in 1:130) {
variable[i] <- colnames(data)
fit <- lm(i~group, data=data)
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
p.value[i] <- p_value
}
k <- data.frame(var = variable, p.value = p.value)
Further problems
I still don't think the above code above will work. Because I am not sure whether the following is doing correct:
variable[i] <- colnames(data)
fit <- lm(i~group, data=data)
During the loop, data is not changed, so colnames(data) returns a vector, hence var[i] <- colnames(data) will trigger error.
i~group looks odd. Do you have i in your data?
I can't help you solve these issues. I have no idea of what your data looks like. But if you could put in a subset of your data, it would be OK.
Follow-up (2016-06-19)
Thank you. This has been extremely helpful. I don't have "i" in my data, but I was hoping that I could use that to represent the different column names, so that it goes through all of them. Is there a way to assign column names numbers so that this would work?
Yes, but I need to know what you have for each column.
Column 1 has a group number. The following columns have data for different factors I am looking at.
OK, so I think ncol(data) = 131, where the first column is group, and the remaining 130 columns are what you will test. Then this should work:
variable <- colnames(data)[-1]
p.value <- numeric(130)
for (i in 1:130) {
fit <- lm(paste(variable[i], "group", sep = "~"), data=data)
fstatistic <- summary(fit)$fstatistic
p_value <- unname(1 - pf(fstatistic[1], fstatistic[2], fstatistic[3]))
p.value[i] <- p_value
}
k <- data.frame(var = variable, p.value = p.value)
It is possible to use sapply() instead of the above for loop. But I think there is no performance difference, as loop overhead is so much tiny compared with lm() and summary().
I think this can get you started at least. It uses the dplyr and broom packages. The basic idea is to define all the formulas you want as characters then use lapply() to run them through lm().
library(dplyr)
library(broom)
# Generate a vector of wanted formulas
forms <- c("mpg ~ cyl", "mpg ~ wt")
# Function to apply formula
lmit <- function(form){
tidy(lm(as.formula(form), mtcars)) %>%
mutate(formula = form)
}
# Apply it and bind into a dataframe
results <- bind_rows(lapply(forms, lmit))
I need to apply a linear regression formula to all the columns in my dataframe called mydf. There are thousands of columns in mydf, so indicating each of them would not be possible in the formula. There are two columns, weight and age which will remain same in the formula for all other columns in which I want to apply this formula.
The formula for first column (column bmd) is
fit1 <- lm(bmd ~ weight + age, data=mydf)
mydf[,"a"] <-fit1$fitted.values
I want to apply this formula to other columns as well (except weight and age)
fit1 <- lm(bp ~ weight + age, data=mydf)
mydf[,"bp"] <-fit1$fitted.values
and
fit1 <- lm(choles ~ weight + age, data=mydf)
mydf[,"choles"] <-fit1$fitted.values
what would be the best (time efficient way as it takes really long time) way to expand this formula and store the fitted.values in the right column across all the wanted columns (bmd,bp,choles) ?
mydf
bmd bp choles weight age
1 2 3 22.3 12
2 1 2 33.2 13
3 2 5 44.5 16
Try this:
apply(mydf[ ,-c(4:5)], 2, function(x) lm(x ~ mydf$weight + mydf$age)$fitted.values)
EDIT: added missing comma
One method:
# get names of all dependent variables
dependents <- names(df[, -which(names(df) %in% c("weight", "age"))]
# build a little function
myFit <- function(depName) {
fit1 <- lm(as.formula(paste(depName, "~ weight + age")), data=mydf)
return(fit1$fitted.values)
}
# sapply them
fittedValues <- sapply(dependents, myFit)
Then cbind to your dataset.