I'm new to R but am slowly learning it to analyse a data set.
Let's say I have a data frame which contains 8 variables and 20 observations. Of the 8 variables, V1 - V3 are predictors and V4 - V8 are outcomes.
B = matrix(c(1:160),
nrow = 20,
ncol = 8,)
df <- as.data.frame(B)
Using the car package, to perform a simple linear regression, display summary and confidence intervals is:
fit <- lm(V4 ~ V1, data = df)
summary(fit)
confint(fit)
How can I write code (loop or apply) so that R regresses each predictor on each outcome individually and extracts the coefficients and confidence intervals? I realise I'm probably trying to run before I can walk but any help would be really appreciated.
You could wrap your lines in a lapply call and train a linear model for each of your predictors (excluding the target, of course).
my.target <- 4
my.predictors <- 1:8[-my.target]
lapply(my.predictors, (function(i){
fit <- lm(df[,my.target] ~ df[,i])
list(summary= summary(fit), confint = confint(fit))
}))
You obtain a list of lists.
So, the code in my own data that returns the error is:
my.target <- metabdata[c(34)]
my.predictors <- metabdata[c(18 : 23)]
lapply(my.predictors, (function(i){
fit <- lm(metabdata[, my.target] ~ metabdata[, i])
list(summary = summary(fit), confint = confint(fit))
}))
Returns:
Error: Unsupported index type: tbl_df
Related
I am trying to use the apply function to create multiple linear models. My data has 32 linear models that will be created using 30 predictor variables. In this case, I am using the dummy data mtcars with only 5 predictors and 5 response variables as seen below (my actual data has about 30 of each). I want to use the lapply () function to create the models and then feed each individual model seperately into a stepAIC function and then feed those models seperately into a stepVIF fucniton. In the end I want to output a list of all of the model results after the stepVIF funciton. Beginnings of code my code are below, I belive my major issue currenlty is the multiple na's that I have in my data. I only wish to remove na's from the predictor column when that model runs and not at the beginning from the entire data frame otherwise it removes some data that can be used. I can get this to work if I input one response variable at a time and copy paste the code over and over.
n=6
full.model <- lapply(1:n, function(p) lm((p)~., cars_data[, c(p, 7:11)])) #Fit full lm model to the last 5 variables in mt cars and create a model for the first 6 variables in mtcars
step.model <- stepAIC(full.model, direction = "both", trace = FALSE) #Fit stepwise regression model
stepmod3 <- stepVIF(model = step.model, threshold = 10, verbose = TRUE)
I am currently running into a variable lengths differ error
This should do it:
n=6
#Fit full lm model to the last 5 variables in mt cars and create a model for the first 6 variables in mtcars
full.model <- lapply(1:n, function(p) lm(reformulate(names(mtcars)[7:11], response=names(mtcars)[p]), mtcars))
step.model <- lapply(full.model, function(x)MASS::stepAIC(x, direction = "both", trace = FALSE)) #Fit stepwise regression model
stepmod3 <- lapply(step.model, function(x)pedometrics::stepVIF(model = x, threshold = 10, verbose = TRUE))
When missing data patterns are different.
When missing data patterns differ and you only want to delete the observations that have some missing data for that particular model, the easiest thing is to embed the data in the call. For example, here's what it would look like with the mtcars data:
library(dplyr)
n=6
#Fit full lm model to the last 5 variables in mt cars and create a model for the first 6 variables in mtcars
full.model <- lapply(1:n, function(p){
args <- list(
formula = reformulate(names(mtcars)[7:11], response=names(mtcars)[p]),
data= mtcars %>% select(p,7:11) %>% na.omit())
do.call("lm", args)
})
step.model <- lapply(full.model, function(x)MASS::stepAIC(x, direction = "both", trace = FALSE)) #Fit stepwise regression model
stepmod3 <- lapply(step.model, function(x)pedometrics::stepVIF(model = x, threshold = 10, verbose = TRUE))
I am new to R and am trying to loop a mixed model across 90 columns in a dataset.
My dataset looks like the following one but has 90 predictors instead of 7 that I need to evaluate as fixed effects in consecutive models.
I then need to store the model output (coefficients and P values) to finally construct a figure summarizing the size effects of each predictor. I know the discussion of P value estimates from lme4 mixed models.
For example:
set.seed(101)
mydata <- tibble(id = rep(1:32, times=25),
time = sample(1:800),
experiment = rep(1:4, times=200),
Y = sample(1:800),
predictor_1 = runif(800),
predictor_2 = rnorm(800),
predictor_3 = sample(1:800),
predictor_4 = sample(1:800),
predictor_5 = seq(1:800),
predictor_6 = sample(1:800),
predictor_7 = runif(800)) %>% arrange (id, time)
The model to iterate across the N predictors is:
library(lme4)
library(lmerTest) # To obtain new values
mixed.model <- lmer(Y ~ predictor_1 + time + (1|id) + (1|experiment), data = mydata)
summary(mixed.model)
My coding skills are far from being able to set a loop to repeat the model across the N predictors in my dataset and store the coefficients and P values in a dataframe.
I have been able to iterate across all the predictors fitting linear models instead of mixed models using lapply. But I have failed to apply this strategy with mixed models.
varlist <- names(mydata)[5:11]
lm_models <- lapply(varlist, function(x) {
lm(substitute(Y ~ i, list(i = as.name(x))), data = mydata)
})
One option is to update the formula of a restricted model (w/o predictor) in an lapply loop over the predictors. Then summaryze the resulting list and subset the coefficient matrix using a Vectorized function.
library(lmerTest)
mixed.model <- lmer(Y ~ time + (1|id) + (1|experiment), data = mydata)
preds <- grep('pred', names(mydata), value=TRUE)
fits <- lapply(preds, \(x) update(mixed.model, paste('. ~ . + ', x)))
extract_coef_p <- Vectorize(\(x) x |> summary() |> coef() |> {\(.) .[3, c(1, 5)]}())
res <- `rownames<-`(t(extract_coef_p(fits)), preds)
res
# Estimate Pr(>|t|)
# predictor_1 -7.177579138 0.8002737
# predictor_2 -5.010342111 0.5377551
# predictor_3 -0.013030513 0.7126500
# predictor_4 -0.041702039 0.2383835
# predictor_5 -0.001437124 0.9676346
# predictor_6 0.005259293 0.8818644
# predictor_7 31.304496255 0.2511275
Im trying to create AIC scores for several different models in a for loop.
I have created a for loop with the log likeliness for each model. However, I am stuck to create the lm function so that it calculates a model for each combination of my column LOGABUNDANCE with columns 4 to 11 of my dataframe.
This is the code I have used so far. But that gives me a similar AIC score for every model.
# AIC score for every model
LL <- rep(NA, 10)
AIC <- rep(NA, 10)
for(i in 1:10){
mod <- lm(LOGABUNDANCE ~ . , data = butterfly)
sigma = as.numeric(summary(mod)[6])
LL[i] <- sum(log(dnorm(butterfly$LOGABUNDANCE, predict(mod), sigma)))
AIC[i] <- -2*LL[i] + 2*(2)
}
You get the same AIC for every model, because you create 10 equal models.
To make the code work, you need some way of changing the model in each iteration.
I can see two options:
Either subset the data in the start of each iteration so it only contains LOGABUNDANCE and one other variable (as suggested by #yacine-hajji in the comments), or
Create a vector of the variables you want to create models with, and use as.formula() together with paste0() to create a new formula for each iteration.
I think solution 2 is easier. Here is a working example of solution 2, using mtcars:
# AIC score for every model
LL <- rep(NA, 10)
AIC <- rep(NA, 10)
# Say I want to model all variables against `mpg`:
# Create a vector of all variable names except mpg
variables <- names(mtcars)[-1]
for(i in 1:10){
# Note how the formula is different in each iteration
mod <- lm(
as.formula(paste0("mpg ~ ", variables[i])),
data = mtcars
)
sigma = as.numeric(summary(mod)[6])
LL[i] <- sum(log(dnorm(mtcars$mpg, predict(mod), sigma)))
AIC[i] <- -2*LL[i] + 2*(2)
}
Output:
AIC
#> [1] 167.3716 168.2746 179.3039 188.8652 164.0947 202.6534 190.2124 194.5496
#> [9] 200.4291 197.2459
I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)
I wish to create a model matrix of the independent variables/specific levels of categorical variables selected by LASSO so that I can plug said model matrix into a glm() function to run a logistic regression.
I have included an example of what I'm trying to do. Any help would be greatly appreciated
data("iris")
iris$Petal.Width <- factor(iris$Petal.Width)
iris$Sepal.Length2 <- ifelse(iris$Sepal.Length>=5.8,1,0)
f <- as.formula(Sepal.Length2~Sepal.Width+Petal.Length+Petal.Width+Species)
X <- model.matrix(f,iris)[,-1]
Y <- iris$Sepal.Length2
cvfit <- cv.glmnet(X,Y,alpha=1,family="binomial")
fit <- glmnet(X,Y,alpha=1,family = "binomial")
b <- coef(cvfit,s="lambda.1se")
print(b)
## This is the part I am unsure of: I want to create a model matrix of the non-zero coefficients contained within 'b'
## e.g.
lasso_x <- model.matrix(b,iris)
logistic_model <- glm.fit(lasso_x,Y,family = "binomial")
Edit:
I also tried the following:
model.matrix(~X)[which(b!=0)-1]
but it just gives me a single column of 1's, the length of the number of selections from LASSO (minus the intercept)