So I am trying to univariate logistic regression analysis on some data I have.
Basically I have a data frame with 1 response variable and 50 predictors.
In order to analyse it I just use the glm function as:
glm(response_var~predictor_var1, data = mydata, family = binomial(link=logit))
However, I don't want to do that manually for all 50 predictors, and it doesn't seem like looping works here. I have tried to say something like this:
predictors <- colnames(mydata)[-c(1)]
glm_list <- list()
i <- 1
for (predictor in predictors) {
model <- glm(response_var~predictor, data = mydata, family = binomial(link=logit))
glm_list[[i]] <- model
i <- i + 1
}
So here I just create a list with the names of the predictors in the data frame through colnames.
But when doing this I just get the error:
variable lengths differ (found for 'predictors')
What am I doing wrong here ?
Try with lapply and as.formula():
"%+%" <- function(x,y) paste(x, y, sep = "")
lapply(predictors, function(x){
glm(as.formula("response_var ~ " %+% x), data = mydata, family = binomial(link = logit))
})
You are passing a character vector, and first you must coerce it to formula.
Hope it helps.
Related
I am new to R and am trying to loop a mixed model across 90 columns in a dataset.
My dataset looks like the following one but has 90 predictors instead of 7 that I need to evaluate as fixed effects in consecutive models.
I then need to store the model output (coefficients and P values) to finally construct a figure summarizing the size effects of each predictor. I know the discussion of P value estimates from lme4 mixed models.
For example:
set.seed(101)
mydata <- tibble(id = rep(1:32, times=25),
time = sample(1:800),
experiment = rep(1:4, times=200),
Y = sample(1:800),
predictor_1 = runif(800),
predictor_2 = rnorm(800),
predictor_3 = sample(1:800),
predictor_4 = sample(1:800),
predictor_5 = seq(1:800),
predictor_6 = sample(1:800),
predictor_7 = runif(800)) %>% arrange (id, time)
The model to iterate across the N predictors is:
library(lme4)
library(lmerTest) # To obtain new values
mixed.model <- lmer(Y ~ predictor_1 + time + (1|id) + (1|experiment), data = mydata)
summary(mixed.model)
My coding skills are far from being able to set a loop to repeat the model across the N predictors in my dataset and store the coefficients and P values in a dataframe.
I have been able to iterate across all the predictors fitting linear models instead of mixed models using lapply. But I have failed to apply this strategy with mixed models.
varlist <- names(mydata)[5:11]
lm_models <- lapply(varlist, function(x) {
lm(substitute(Y ~ i, list(i = as.name(x))), data = mydata)
})
One option is to update the formula of a restricted model (w/o predictor) in an lapply loop over the predictors. Then summaryze the resulting list and subset the coefficient matrix using a Vectorized function.
library(lmerTest)
mixed.model <- lmer(Y ~ time + (1|id) + (1|experiment), data = mydata)
preds <- grep('pred', names(mydata), value=TRUE)
fits <- lapply(preds, \(x) update(mixed.model, paste('. ~ . + ', x)))
extract_coef_p <- Vectorize(\(x) x |> summary() |> coef() |> {\(.) .[3, c(1, 5)]}())
res <- `rownames<-`(t(extract_coef_p(fits)), preds)
res
# Estimate Pr(>|t|)
# predictor_1 -7.177579138 0.8002737
# predictor_2 -5.010342111 0.5377551
# predictor_3 -0.013030513 0.7126500
# predictor_4 -0.041702039 0.2383835
# predictor_5 -0.001437124 0.9676346
# predictor_6 0.005259293 0.8818644
# predictor_7 31.304496255 0.2511275
I have a data frame DF with the columns TrainStation,timeofday,Date and NumberOfPassenger.
First you see the code, then the file for the regression and finally the list with the regression
TrainStation<-c("East","North","East","North","North","Central","North",
"Central","East","North","East","North","Central","North",
"Central","North","Central","North","Central","North","Central",
"North","Central","East","North","East","North","Central","North",
"Central","East","North","East","North","Central","East")
TimeOfday<-c(12,12,8,16,10,6,0,7,1,3,23,15,12,8,16,10,1,3,5,7,9,10,12,11,17,2,4,5,
13,14,18,19,20,21,22,23)
Date<-sample(seq(as.Date('2019/01/01'), as.Date('2019/02/28'), by="day"), 36)
Date<-as.character(Date)
DF<-cbind(TrainStation,TimeOfday,Date)
DF<-as.data.frame(DF)
#Weekdays
DF$Date<-as.Date(DF$Date)
DF$Date<-weekdays(DF$Date)
#TimeOfday
DF$TimeOfday<-strptime(DF$TimeOfday,format = "%H")
DF$TimeOfday<-hour(DF$TimeOfday)
DF$TrainStation<-as.character(DF$TrainStation)
DF$TimeOfday<-as.factor(DF$TimeOfday)
DF$Date<-as.factor(DF$Date)
And my data for the regression is this:
library(tidyverse)
DF2<-DF%>%
group_by(TrainStation,Date,TimeOfday)%>%
summarize(NumberOfPassenger = n_distinct(TrainStation))
then I make a list with my data, followed by a regression (glm)
#List and glm
l_DF2<-split(DF2,DF2$TrainStation)
lapply(l_DF2, function(x) glm(formula = NumberOfPassenger~TimeOfday+Date,family = poisson(link = "log"), data = x))
Question:
Now I would like to look at the coefficients and make some Plots. But how do I get the coefficients from the list?
Plot example for a different model:
mod<-glm(formula = NumberOfPassenger~TrainStation+TimeOfday+Date,family = poisson(link = "log"), data = DF2)
i could show the coefficients of the TimeOfday in a plot with this:
barplot(coef(mod)[grep("TimeOfday",names(coef(mod)))])
How can I plot in my case?
What about
res <- lapply(l_DF2,
function(x) {
glm(formula = NumberOfPassenger ~ TimeOfday + Date,
family = poisson(link = "log"), data = x)
}
)
lapply(res, coef)
To extract the coefficients of a model, you can use stats::coef(). Now you only have to iterate over your list, which you can do using lapply() as you have already done to fit the models.
If you did not have stats::coef() available, you could just extract the coefficients entries from the list using something like
lapply(res, `[[`, "coefficients")
# or slightly more verbose
lapply(res, function(x) x[["coefficients"]])
I need to cross-validate several glmer models on the same data so I've made a function to do this (I'm not interested in preexisting functions for doing this). I want to pass an arbitrary glmer model to my function as the only argument. Sadly, I can't figure out how to do this, and the interwebz won't tell me.
Ideally, I would like to do something like:
model = glmer(y ~ x + (1|z), data = train_folds, family = "binomial"
model2 = glmer(y ~ x2 + (1|z), data = train_folds, family = "binomial"
And then call cross_validation_function(model) and cross_validation_function(model2). The training data within the function is called train_fold.
However, I suspect I need to pass the model formula in different way using reformulate.
Here is an example of my function. The project is about predicting autism(ASD) from behavioral features. The data variable is da.
library(pacman)
p_load(tidyverse, stringr, lmerTest, MuMIn, psych, corrgram, ModelMetrics,
caret, boot)
cross_validation_function <- function(model){
#creating folds
participants = unique(da$participant)
folds <- createFolds(participants, 10)
cross_val <- sapply(seq_along(folds), function(x) {
train_folds = filter(da, !(as.numeric(participant) %in% folds[[x]]))
predict_fold = filter(da, as.numeric(participant) %in% folds[[x]])
#model to be tested should be passed as an argument here
train_model <- model
predict_fold <- predict_fold %>%
mutate(predictions_perc = predict(train_model, predict_fold, allow.new.levels = T),
predictions_perc = inv.logit(predictions_perc),
predictions = ifelse(predictions_perc > 0.5, "ASD","control"))
conf_mat <- caret::confusionMatrix(data = predict_fold$predictions, reference = predict_fold$diagnosis, positive = "ASD")
accuracy <- conf_mat$overall[1]
sensitivity <- conf_mat$byClass[1]
specificity <- conf_mat$byClass[2]
fixed_ef <- fixef(train_model)
output <- c(accuracy, sensitivity, specificity, fixed_ef)
})
cross_df <- t(cross_val)
return(cross_df)
}
Solution developed from the comment: Using as.formula strings can be converted into a formula which can passed as arguments to my function in the following way:
cross_validation_function <- function(model_formula){
...
train_model <- glmer(model_formula, data = da, family = "binomial")
...}
formula <- as.formula( "y~ x + (1|z"))
cross_validation_function(formula)
If you aim is to extract the model formula from a fitted model, the you can use
attributes(model)$call[[2]]. Then you can use this formula when fitting model with the cv folds.
mod_formula <- attributes(model)$call[[2]]
train_model = glmer(mod_formula , data = train_data,
family = "binomial")
I wonder if I can use such as for loop or apply function to do the linear regression in R. I have a data frame containing variables such as crim, rm, ad, wd. I want to do simple linear regression of crim on each of other variable.
Thank you!
If you really want to do this, it's pretty trivial with lapply(), where we use it to "loop" over the other columns of df. A custom function takes each variable in turn as x and fits a model for that covariate.
df <- data.frame(crim = rnorm(20), rm = rnorm(20), ad = rnorm(20), wd = rnorm(20))
mods <- lapply(df[, -1], function(x, dat) lm(crim ~ x, data = dat))
mods is now a list of lm objects. The names of mods contains the names of the covariate used to fit the model. The main negative of this is that all the models are fitted using a variable x. More effort could probably solve this, but I doubt that effort is worth the time.
If you are just selecting models, which may be dubious, there are other ways to achieve this. For example via the leaps package and its regsubsets function:
library("leapls")
a <- regsubsets(crim ~ ., data = df, nvmax = 1, nbest = ncol(df) - 1)
summa <- summary(a)
Then plot(a) will show which of the models is "best", for example.
Original
If I understand what you want (crim is a covariate and the other variables are the responses you want to predict/model using crim), then you don't need a loop. You can do this using a matrix response in a standard lm().
Using some dummy data:
df <- data.frame(crim = rnorm(20), rm = rnorm(20), ad = rnorm(20), wd = rnorm(20))
we create a matrix or multivariate response via cbind(), passing it the three response variables we're interested in. The remaining parts of the call to lm are entirely the same as for a univariate response:
mods <- lm(cbind(rm, ad, wd) ~ crim, data = df)
mods
> mods
Call:
lm(formula = cbind(rm, ad, wd) ~ crim, data = df)
Coefficients:
rm ad wd
(Intercept) -0.12026 -0.47653 -0.26419
crim -0.26548 0.07145 0.68426
The summary() method produces a standard summary.lm output for each of the responses.
Suppose you want to have response variable fix as first column of your data frame and you want to run simple linear regression multiple times individually with other variable keeping first variable fix as response variable.
h=iris[,-5]
for (j in 2:ncol(h)){
assign(paste("a", j, sep = ""),lm(h[,1]~h[,j]))
}
Above is the code which will create multiple list of regression output and store it in a2,a3,....
I am trying to create a function to test a logistic regression model developed on a training set.
For example
train <- filter(y, folds != i)
test <- filter(y, folds == i)
I want to be able to use the formula for different data sets.
For example, if I were to take y to be a response variable such as “low” in the birthwt data set and x to be the explanatory variables e.g. “age", “race” how would I implement these arguments into glm.train formula without having to type the function separately for different data sets ?
glm.train <- glm(y ~x, family = binomial, data = train)
You can use reformulate to create a formula based on strings:
x <- c("age", "race")
y <- "low"
form <- reformulate(x, response = y)
# low ~ age + race
Use this formula for glm:
glm.train <- glm(form, family = binomial, data = train)