Regression loops [duplicate] - r

This question already has answers here:
Fitting a linear model with multiple LHS
(1 answer)
Fast pairwise simple linear regression between variables in a data frame
(1 answer)
Closed 4 years ago.
Ciao, I have several columns that represents scores. I want to estimate models where each SCORE is a function of STUDYTIME. So I want to run as many models as there are SCORE columns all simple models that are functions of STUDYTIME. Then I want to store the coefficients of STUDYTIME in a new column that has rownames equal to the SCORE column name. And last of all I am not sure of how to do clustering on the linear models because STUDENTS are each in the data two times.
Here is my replicating example. This is the data I have now:
df <- data.frame(replicate(5, rnorm(10)))
df[1]<-c(1,1,2,2,3,3,4,4,5,5)
colnames(df) <- c('student','studytime', 'score1','score2','score3')
This is my attempt at the coding:
for (i in 1:nrow(df)) {
dfx <- df[,i]
lm <- lm(dfx[,3:5] ~ study_time)
resdat[,i] = summary(lm)$coefficients[2]
}

You can do this using simply lapply and sapply function.
Here is the r code:
Generating Data
df <- data.frame(replicate(5, rnorm(10)))
df[1]<-c(1,1,2,2,3,3,4,4,5,5)
colnames(df) <- c('student','studytime', 'score1','score2','score3')
Storing Results
Results <- lapply(df[, -c(1,2)], FUN = function(x) lm(x ~ df$studytime))
Coef <- sapply(Results, FUN = coefficients)

Related

fitting linear regression models with different predictors using loops

I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)

regarding residual storing in bootstrap regression in R

I am trying to do bootstrapping regression by re-sampling X and Y from original sample.
I followed a more manual approach (without using any package)
This is my work so far ,
set.seed(326581)
X1=rnorm(10,0,1)
Y1=rnorm(10,0,2)
data=data.frame(X1,Y1)
lst <- replicate(
100,
df.smpl <- data %>% sample_n(10, replace = T),
simplify = FALSE)
The list contained 100 samples where each sample has 2 columns (X,Y) with a sample size of 10 . These are the bootstrap samples.
to get bootstrap residuals , i separated the X and Y columns into two seperate data frames as follows,
new1=data.frame(lapply(lst, `[`, 'X1'))
new2=data.frame(lapply(lst, `[`, 'Y1))
After that i tried to store the residuals that got from each model fitted by using the following code,
res=c()
for(i in 1:100)
{
res[i]=residuals(lm(new2[,i]~new1[,i]))
}
But seems like something is wrong. Can anyone help me to figure that out ?
By the way is there any easier approach than this ?
You're doing this unnecessarily complicated. The whole advantage of storing objects in a list is that you can easily loop through them with e.g. lapply or sapply.
So for example, to store the residuals of a linear model fit you can do
res <- lapply(lst, function(df) residuals(lm(Y1 ~ X1, data = df)))
This fits a linear model of the form lm(Y1 ~ X1) to all data.frames in lst, and stores the residuals in a list of 100 vectors
length(res)
#[1] 100
You could also store residuals based on an lm fit to all 100 sampled data.frames in a 10x100 matrix by using sapply instead of lapply
res <- sapply(lst, function(df)
residuals(lm(Y1 ~ X1, data = df)))
dim(res)
#[1] 10 100
Update
In response to your comment you can do the following
First calculate and store residuals and residual-derived weights in every data.frame in the list.
# Add residuals and weights to lst
lst <- lapply(lst, function(df) {
df$res <- residuals(lm(Y1 ~ X1, data = df));
df$weights <- 1 / fitted(lm(abs(res) ~ X1, data = df))^2;
df;
})
Then run a weighted linear regression and return the second (slop) coefficients
# Return 2nd coeffficient of weighted regression
coeff <- lapply(lst, function(df)
coefficients(lm(Y1 ~ X1, data = df , weights = weights))[2])

Running multiple GAMM models using for loop or lapply

Can someone please help me with running multiple GAMM models in a for loop or lapply: I have a set of 10 response and 20 predictor variables in a large data frame arranged in columns.
I'd like to apply GAMM model for each predictor-response combination, and summarize their coefficients and significance tests in a table.
models<-gamm(AnimalCount ~ s(temperature), data=dat,family=poisson(link=log) , random=list(Province=~1) )
I think one way to do this is to create a "matrix" list where the number of rows and columns corresponds to the number of responses (i) and predictors (j), respectively. Then you can store each model result in the cell[i, j]. Let me illustrate:
## make up some data
library(mgcv)
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
Now the made-up data has 2 responses (y, y1) and 8 predictors (x0 ~ x7). I think you can simplify the process by storing the responses and predictors in separate data frames:
d_resp <- d[ c("y", "y1")]
d_pred <- d[, !(colnames(d) %in% c("y", "y1"))]
## create a "matrix" list of dimensions i x j
results_m <- vector("list", length=ncol(d_resp)*ncol(d_pred))
dim(results_m) <- c(ncol(d_resp), ncol(d_pred))
for(i in 1:ncol(d_resp)){
for(j in 1:ncol(d_pred)){
results_m[i, j][[1]] <- gamm(d_resp[, i] ~ s(d_pred[, j]))
}
}
# flatten the "matrix" list
results_l <- do.call("list", results_m)
You can use sapply/lapply to create a data frame to summarize coefficients, etc. Say, you want to extract fixed-effect intercepts and slopes and stored in a data frame.
data.frame(t(sapply(results_l, function(l) l$lme$coef$fixed)))

Get Index of variables from stepAIC

I am regressing a gene on another gene subset. Then I use stepAIC to reduce the number of explanatory genes. How do I get the index of the NON-omitted variables, so that I could analyse them?
gene_subset=c(y=genes[,i], genes[,other_genes]);
reduced_model=stepAIC(y~.,data=gene_subset,trace=false);
Here is one solution that I got from r-help mail list, any other more efficient ways would be welcome.
# create example data frame
y <- rnorm(30)
gene_subset <- data.frame(y, x1=rnorm(30), x2=rnorm(30), x3=100*y+rnorm(30))
# fit a full linear model
fit <- lm(y ~ ., df)
# reduce the model
reduced_model <- stepAIC(fit, trace=FALSE)
# NON-omitted variables (excluding the response)
keepx <- names(reduced_model$model)[-1]
index <- match(keepx, names(gene_subset))

Linear Regression in R with variable number of explanatory variables [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Specifying formula in R with glm without explicit declaration of each covariate
how to succinctly write a formula with many variables from a data frame?
I have a vector of Y values and a matrix of X values that I want to perform a multiple regression on (i.e. Y = X[column 1] + X[column 2] + ... X[column N])
The problem is that the number of columns in my matrix (N) is not prespecified. I know in R, to perform a linear regression you have to specify the equation:
fit = lm(Y~X[,1]+X[,2]+X[,3])
But how do I do this if I don't know how many columns are in my X matrix?
Thanks!
Three ways, in increasing level of flexibility.
Method 1
Run your regression using the formula notation:
fit <- lm( Y ~ . , data=dat )
Method 2
Put all your data in one data.frame, not two:
dat <- cbind(data.frame(Y=Y),as.data.frame(X))
Then run your regression using the formula notation:
fit <- lm( Y~. , data=dat )
Method 3
Another way is to build the formula yourself:
model1.form.text <- paste("Y ~",paste(xvars,collapse=" + "),collapse=" ")
model1.form <- as.formula( model1.form.text )
model1 <- lm( model1.form, data=dat )
In this example, xvars is a character vector containing the names of the variables you want to use.

Resources