I am trying to do bootstrapping regression by re-sampling X and Y from original sample.
I followed a more manual approach (without using any package)
This is my work so far ,
set.seed(326581)
X1=rnorm(10,0,1)
Y1=rnorm(10,0,2)
data=data.frame(X1,Y1)
lst <- replicate(
100,
df.smpl <- data %>% sample_n(10, replace = T),
simplify = FALSE)
The list contained 100 samples where each sample has 2 columns (X,Y) with a sample size of 10 . These are the bootstrap samples.
to get bootstrap residuals , i separated the X and Y columns into two seperate data frames as follows,
new1=data.frame(lapply(lst, `[`, 'X1'))
new2=data.frame(lapply(lst, `[`, 'Y1))
After that i tried to store the residuals that got from each model fitted by using the following code,
res=c()
for(i in 1:100)
{
res[i]=residuals(lm(new2[,i]~new1[,i]))
}
But seems like something is wrong. Can anyone help me to figure that out ?
By the way is there any easier approach than this ?
You're doing this unnecessarily complicated. The whole advantage of storing objects in a list is that you can easily loop through them with e.g. lapply or sapply.
So for example, to store the residuals of a linear model fit you can do
res <- lapply(lst, function(df) residuals(lm(Y1 ~ X1, data = df)))
This fits a linear model of the form lm(Y1 ~ X1) to all data.frames in lst, and stores the residuals in a list of 100 vectors
length(res)
#[1] 100
You could also store residuals based on an lm fit to all 100 sampled data.frames in a 10x100 matrix by using sapply instead of lapply
res <- sapply(lst, function(df)
residuals(lm(Y1 ~ X1, data = df)))
dim(res)
#[1] 10 100
Update
In response to your comment you can do the following
First calculate and store residuals and residual-derived weights in every data.frame in the list.
# Add residuals and weights to lst
lst <- lapply(lst, function(df) {
df$res <- residuals(lm(Y1 ~ X1, data = df));
df$weights <- 1 / fitted(lm(abs(res) ~ X1, data = df))^2;
df;
})
Then run a weighted linear regression and return the second (slop) coefficients
# Return 2nd coeffficient of weighted regression
coeff <- lapply(lst, function(df)
coefficients(lm(Y1 ~ X1, data = df , weights = weights))[2])
Related
I am going to run dozens of regressions of different Ys on the same X. I want to score the coefficients and standard errors of each regression to a single data frame.
The dataframe is like
Y1, Y2, Y3, ... , Y50, X
1, 2, 3, ..., 50, 1
...
I can do it like for each Y
model1 <- lm (Y1~X, data = data)
summary1 <- summary(model1)
list1 <- list(coef=summary1$coefficients[2,1],se=summary1$coefficients[2,2])
# only coef of X is of interest
And then generate the dataframe I want by
df <- as.data.frame(list1,list2,...,list50)
I am a rookie, is there a more neat way to do this in R? I tried to write functions with the name of the variables as input, but it fails if I define it as function(variable) and use variable directly inside the function.
Thank you so much for your inspiration.
You can try using lapply to loop over each Y variables.
cols <- grep('Y\\d+', names(data))
do.call(rbind, lapply(cols, function(x) {
model <- lm(reformulate('X', x), data)
summary <- summary(model)
data.frame(coef = summary$coefficients[2,1],
se = summary$coefficients[2,2])
})) -> df
df
I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)
I'm new to R but am slowly learning it to analyse a data set.
Let's say I have a data frame which contains 8 variables and 20 observations. Of the 8 variables, V1 - V3 are predictors and V4 - V8 are outcomes.
B = matrix(c(1:160),
nrow = 20,
ncol = 8,)
df <- as.data.frame(B)
Using the car package, to perform a simple linear regression, display summary and confidence intervals is:
fit <- lm(V4 ~ V1, data = df)
summary(fit)
confint(fit)
How can I write code (loop or apply) so that R regresses each predictor on each outcome individually and extracts the coefficients and confidence intervals? I realise I'm probably trying to run before I can walk but any help would be really appreciated.
You could wrap your lines in a lapply call and train a linear model for each of your predictors (excluding the target, of course).
my.target <- 4
my.predictors <- 1:8[-my.target]
lapply(my.predictors, (function(i){
fit <- lm(df[,my.target] ~ df[,i])
list(summary= summary(fit), confint = confint(fit))
}))
You obtain a list of lists.
So, the code in my own data that returns the error is:
my.target <- metabdata[c(34)]
my.predictors <- metabdata[c(18 : 23)]
lapply(my.predictors, (function(i){
fit <- lm(metabdata[, my.target] ~ metabdata[, i])
list(summary = summary(fit), confint = confint(fit))
}))
Returns:
Error: Unsupported index type: tbl_df
Can someone please help me with running multiple GAMM models in a for loop or lapply: I have a set of 10 response and 20 predictor variables in a large data frame arranged in columns.
I'd like to apply GAMM model for each predictor-response combination, and summarize their coefficients and significance tests in a table.
models<-gamm(AnimalCount ~ s(temperature), data=dat,family=poisson(link=log) , random=list(Province=~1) )
I think one way to do this is to create a "matrix" list where the number of rows and columns corresponds to the number of responses (i) and predictors (j), respectively. Then you can store each model result in the cell[i, j]. Let me illustrate:
## make up some data
library(mgcv)
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
Now the made-up data has 2 responses (y, y1) and 8 predictors (x0 ~ x7). I think you can simplify the process by storing the responses and predictors in separate data frames:
d_resp <- d[ c("y", "y1")]
d_pred <- d[, !(colnames(d) %in% c("y", "y1"))]
## create a "matrix" list of dimensions i x j
results_m <- vector("list", length=ncol(d_resp)*ncol(d_pred))
dim(results_m) <- c(ncol(d_resp), ncol(d_pred))
for(i in 1:ncol(d_resp)){
for(j in 1:ncol(d_pred)){
results_m[i, j][[1]] <- gamm(d_resp[, i] ~ s(d_pred[, j]))
}
}
# flatten the "matrix" list
results_l <- do.call("list", results_m)
You can use sapply/lapply to create a data frame to summarize coefficients, etc. Say, you want to extract fixed-effect intercepts and slopes and stored in a data frame.
data.frame(t(sapply(results_l, function(l) l$lme$coef$fixed)))
I want to use lm() in R to fit a series (actually 93) separate linear regressions. According to the R lm() help manual:
"If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix."
This works fine as long as there are no missing data points in the Y response matrix. When there are missing points, instead of fitting each regression with the available data, every row that has a missing data point in any column is discarded. Is there any way to specify that lm() should fit all of the columns in Y independently and not discard rows where an individual column has a missing data point?
If you are looking to do n regressions between Y1, Y2, ..., Yn and X, you don't specify that with lm() rather you should use R's apply functions:
# create the response matrix and set some random values to NA
values <- runif(50)
values[sample(1:length(values), 10)] <- NA
Y <- data.frame(matrix(values, ncol=5))
colnames(Y) <- paste0("Y", 1:5)
# single regression term
X <- runif(10)
# create regression between each column in Y and X
lms <- lapply(colnames(Y), function(y) {
form <- paste0(y, " ~ X")
lm(form, data=Y)
})
# lms is a list of lm objects, can access them via [[]] operator
# or work with it using apply functions once again
sapply(lms, function(x) {
summary(x)$adj.r.squared
})
#[1] -0.06350560 -0.14319796 0.36319518 -0.16393125 0.04843368