using lm() in R for a series of independent fits - r

I want to use lm() in R to fit a series (actually 93) separate linear regressions. According to the R lm() help manual:
"If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix."
This works fine as long as there are no missing data points in the Y response matrix. When there are missing points, instead of fitting each regression with the available data, every row that has a missing data point in any column is discarded. Is there any way to specify that lm() should fit all of the columns in Y independently and not discard rows where an individual column has a missing data point?

If you are looking to do n regressions between Y1, Y2, ..., Yn and X, you don't specify that with lm() rather you should use R's apply functions:
# create the response matrix and set some random values to NA
values <- runif(50)
values[sample(1:length(values), 10)] <- NA
Y <- data.frame(matrix(values, ncol=5))
colnames(Y) <- paste0("Y", 1:5)
# single regression term
X <- runif(10)
# create regression between each column in Y and X
lms <- lapply(colnames(Y), function(y) {
form <- paste0(y, " ~ X")
lm(form, data=Y)
})
# lms is a list of lm objects, can access them via [[]] operator
# or work with it using apply functions once again
sapply(lms, function(x) {
summary(x)$adj.r.squared
})
#[1] -0.06350560 -0.14319796 0.36319518 -0.16393125 0.04843368

Related

fitting linear regression models with different predictors using loops

I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)

regarding residual storing in bootstrap regression in R

I am trying to do bootstrapping regression by re-sampling X and Y from original sample.
I followed a more manual approach (without using any package)
This is my work so far ,
set.seed(326581)
X1=rnorm(10,0,1)
Y1=rnorm(10,0,2)
data=data.frame(X1,Y1)
lst <- replicate(
100,
df.smpl <- data %>% sample_n(10, replace = T),
simplify = FALSE)
The list contained 100 samples where each sample has 2 columns (X,Y) with a sample size of 10 . These are the bootstrap samples.
to get bootstrap residuals , i separated the X and Y columns into two seperate data frames as follows,
new1=data.frame(lapply(lst, `[`, 'X1'))
new2=data.frame(lapply(lst, `[`, 'Y1))
After that i tried to store the residuals that got from each model fitted by using the following code,
res=c()
for(i in 1:100)
{
res[i]=residuals(lm(new2[,i]~new1[,i]))
}
But seems like something is wrong. Can anyone help me to figure that out ?
By the way is there any easier approach than this ?
You're doing this unnecessarily complicated. The whole advantage of storing objects in a list is that you can easily loop through them with e.g. lapply or sapply.
So for example, to store the residuals of a linear model fit you can do
res <- lapply(lst, function(df) residuals(lm(Y1 ~ X1, data = df)))
This fits a linear model of the form lm(Y1 ~ X1) to all data.frames in lst, and stores the residuals in a list of 100 vectors
length(res)
#[1] 100
You could also store residuals based on an lm fit to all 100 sampled data.frames in a 10x100 matrix by using sapply instead of lapply
res <- sapply(lst, function(df)
residuals(lm(Y1 ~ X1, data = df)))
dim(res)
#[1] 10 100
Update
In response to your comment you can do the following
First calculate and store residuals and residual-derived weights in every data.frame in the list.
# Add residuals and weights to lst
lst <- lapply(lst, function(df) {
df$res <- residuals(lm(Y1 ~ X1, data = df));
df$weights <- 1 / fitted(lm(abs(res) ~ X1, data = df))^2;
df;
})
Then run a weighted linear regression and return the second (slop) coefficients
# Return 2nd coeffficient of weighted regression
coeff <- lapply(lst, function(df)
coefficients(lm(Y1 ~ X1, data = df , weights = weights))[2])

Predict() + assign() loop across variables

I have estimated several models (a, b) and I want to calculate predicted probabilities for each model using a single data frame (df) and store the predicted probabilities of each model as new variables in that data frame. For example:
a <- lm(y ~ z, df) # estimate model a
b <- glm(w ~ x, df) # estimate model b
models <- c("a","b") # create vector of model objects
for (i in models) {
assign(
paste("df$", i, sep = ""),
predict(i, df)
)}
I have tried the above but receive the error "no applicable method for 'predict' applied to an object of class "character"" with the last word changing as I change class of the predicted object, e.g. predict(as.numeric(i),df).
Any ideas? Ideally I could vectorize this as well.
You should rarely have to use assign() and $ should not be used with variable names. The [[]] operator is better for dynamic subsetting than $. And it would be easier if you just made a list if the models rather than just their names. Here's an example
df<-data.frame(x=runif(30), y=runif(30), w=runif(30), z=runif(30))
a <- lm(y ~ z, df) # estimate model a
b <- lm(w ~ x, df) # estimate model b
models <- list(a=a,b=b) # create vector of model objects
# 1) for loop
for (m in names(models)) {
df[[m]] <- predict(models[[m]], df)
}
Or rather than a for loop, you could generate all the values with Map and then append with cdbind afterward
# 2) Map/cbind
df <- cbind(df, Map(function(m) predict(m,df), models))

R:fit dynamic number of explanatory variable into polynomial regression

Suppose I was given an data frame df on runtime, how do I fit a polynomial model using polynomial regression, with each predictor is a column from df and has a degree of a constant k >= 2
The difficulty is, 'df' is read during runtime so the number and names of its columns are unknown when the script is written.(but I do know the response variable is the 1st column) So when I call lm I do not know how to write the formula.
In case of k = 1, then I can simply write a generic linear formula
names(df)[1] <- "y"
lm(y ~ ., data = df)
is there something similar I can do for polynomial formula?
One rather convoluted way is to create a formula for the lm regression call by pasting the terms together.
# some data
dat <- data.frame(replicate(10, rnorm(20)))
# Create formula - apply f function to all columns names excluding the first
form <- formula(paste(names(dat)[1], " ~ ",
paste0("poly(", names(dat)[-1], ", 2)", collapse="+")))
# run regression
lm(form , data=dat)

Get Index of variables from stepAIC

I am regressing a gene on another gene subset. Then I use stepAIC to reduce the number of explanatory genes. How do I get the index of the NON-omitted variables, so that I could analyse them?
gene_subset=c(y=genes[,i], genes[,other_genes]);
reduced_model=stepAIC(y~.,data=gene_subset,trace=false);
Here is one solution that I got from r-help mail list, any other more efficient ways would be welcome.
# create example data frame
y <- rnorm(30)
gene_subset <- data.frame(y, x1=rnorm(30), x2=rnorm(30), x3=100*y+rnorm(30))
# fit a full linear model
fit <- lm(y ~ ., df)
# reduce the model
reduced_model <- stepAIC(fit, trace=FALSE)
# NON-omitted variables (excluding the response)
keepx <- names(reduced_model$model)[-1]
index <- match(keepx, names(gene_subset))

Resources