I have a database where I want to do several multiple regressions. They all look like this:
fit <- lm(Variable1 ~ Age + Speed + Gender + Mass, data=Data)
The only variable changing is variable1. Now I want to loop or use something from the apply family to loop several variables at the place of variable1. These variables are columns in my datafile. Can someone help me to solve this problem? Many thanks!
what I tried so far:
When I extract one of the column names with the names() function I do get a the name of the column:
varname = as.name(names(Data[14]))
But when I fill this in (and I used the attach() function):
fit <- lm(Varname ~ Age + Speed + Gender + Mass, data=Data)
I get the following error:
Error in model.frame.default(formula = Varname ~ Age + Speed + Gender
+ : object is not a matrix
I suppose that the lm() function does not recognize Varname as Variable1.
You can use lapply to loop over your variables.
fit <- lapply(Data[,c(...)], function(x) lm(x ~ Age + Speed + Gender + Mass, data = Data))
This gives you a list of your results.
The c(...) should contain your variable names as strings. Alternatively, you can choose the variables by their position in Data, like Data[,1:5].
The problem in your case is that the formula in the lm function attempts to read the literal names of columns in the data or feed the whole vector into the regression. Therefore, to use the column name, you need to tell the formula to interpret the value of the variable varnames and incorporate it with the other variables.
# generate some data
set.seed(123)
Data <- data.frame(x = rnorm(30), y = rnorm(30),
Age = sample(0:90, 30), Speed = rnorm(30, 60, 10),
Gender = sample(c("W", "M"), 30, rep=T), Mass = rnorm(30))
varnames <- names(Data)[1:2]
# fit regressions for multiple dependent variables
fit <- lapply(varnames,
FUN=function(x) lm(formula(paste(x, "~Age+Speed+Gender+Mass")), data=Data))
names(fit) <- varnames
fit
$x
Call:
lm(formula = formula(paste(x, "~Age+Speed+Gender+Mass")), data = Data)
Coefficients:
(Intercept) Age Speed GenderW Mass
0.135423 0.010013 -0.010413 0.023480 0.006939
$y
Call:
lm(formula = formula(paste(x, "~Age+Speed+Gender+Mass")), data = Data)
Coefficients:
(Intercept) Age Speed GenderW Mass
2.232269 -0.008035 -0.027147 -0.044456 -0.023895
Related
I am new to R and am trying to loop a mixed model across 90 columns in a dataset.
My dataset looks like the following one but has 90 predictors instead of 7 that I need to evaluate as fixed effects in consecutive models.
I then need to store the model output (coefficients and P values) to finally construct a figure summarizing the size effects of each predictor. I know the discussion of P value estimates from lme4 mixed models.
For example:
set.seed(101)
mydata <- tibble(id = rep(1:32, times=25),
time = sample(1:800),
experiment = rep(1:4, times=200),
Y = sample(1:800),
predictor_1 = runif(800),
predictor_2 = rnorm(800),
predictor_3 = sample(1:800),
predictor_4 = sample(1:800),
predictor_5 = seq(1:800),
predictor_6 = sample(1:800),
predictor_7 = runif(800)) %>% arrange (id, time)
The model to iterate across the N predictors is:
library(lme4)
library(lmerTest) # To obtain new values
mixed.model <- lmer(Y ~ predictor_1 + time + (1|id) + (1|experiment), data = mydata)
summary(mixed.model)
My coding skills are far from being able to set a loop to repeat the model across the N predictors in my dataset and store the coefficients and P values in a dataframe.
I have been able to iterate across all the predictors fitting linear models instead of mixed models using lapply. But I have failed to apply this strategy with mixed models.
varlist <- names(mydata)[5:11]
lm_models <- lapply(varlist, function(x) {
lm(substitute(Y ~ i, list(i = as.name(x))), data = mydata)
})
One option is to update the formula of a restricted model (w/o predictor) in an lapply loop over the predictors. Then summaryze the resulting list and subset the coefficient matrix using a Vectorized function.
library(lmerTest)
mixed.model <- lmer(Y ~ time + (1|id) + (1|experiment), data = mydata)
preds <- grep('pred', names(mydata), value=TRUE)
fits <- lapply(preds, \(x) update(mixed.model, paste('. ~ . + ', x)))
extract_coef_p <- Vectorize(\(x) x |> summary() |> coef() |> {\(.) .[3, c(1, 5)]}())
res <- `rownames<-`(t(extract_coef_p(fits)), preds)
res
# Estimate Pr(>|t|)
# predictor_1 -7.177579138 0.8002737
# predictor_2 -5.010342111 0.5377551
# predictor_3 -0.013030513 0.7126500
# predictor_4 -0.041702039 0.2383835
# predictor_5 -0.001437124 0.9676346
# predictor_6 0.005259293 0.8818644
# predictor_7 31.304496255 0.2511275
I currently have a problem in that I have to pre-specify my formulas before sending them into a regression function. For example, using the stan_gamm4 function in R, we have the following example:
dat <- mgcv::gamSim(1, n = 400, scale = 2) ## simulate 4 term additive truth
## Now add 20 level random effect `fac'...
dat$fac <- fac <- as.factor(sample(1:20, 400, replace = TRUE))
dat$y <- dat$y + model.matrix(~ fac - 1) %*% rnorm(20) * .5
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = ~ (1 | fac),
chains = 1, iter = 200) # for example speed
Now, because the formula and random formula were specified explicitly, then if we call:
br$call$random
> ~(1 | fac)
We are able to retrieve the form of the random effects.
NOW, let us then leave everything the same, BUT use an expression for the random part:
formula.rand <- as.formula( '~(1|fac)' )
Then, if we did the same thing before, but with formula.rand taking the place, we have:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand,
chains = 1, iter = 200) # for example speed
BUT NOW: we have that:
br$call$random
> formula.rand
Instead of the original. A lot of bayesian packages rely on accessing br$call$random, so is there a way to use a variable for formula, have it pass in, AND retain the original relation when calling br$call$random? Thanks.
While I haven't used Stan, this is a problem inherent in the way that R handles storing calls. You can see it happening with lm, for example:
model <- function(formula)
{
lm(formula, data=mtcars)
}
m <- model(mpg ~ disp)
m$call$formula
# formula
The simplest solution is to construct the call using substitute to insert the actual values you want to keep, not the symbol name. In the case of lm, this would be something like
model2 <- function(formula)
{
call <- substitute(lm(formula=.f, data=mtcars), list(.f=formula))
eval(call)
}
m2 <- model2(mpg ~ disp)
m2$call$formula
# mpg ~ disp
For Stan, you can do
stan_call <- substitute(br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data=dat, random=.rf,
chains=1, iter=200),
list(.rf=formula.rand))
br <- eval(stan_call)
If I understand correctly, your problem is not, that stan_gamm4 could be computing incorrect results (which is not the case, from what I gather), but only that br$call$random refers to the variable name and not the formula. This seems to be problematic for further post-processing of the model.
Since stan_gamm4 uses match.call inside to find the call, I don't know of a way to specify the model differently to obtain a "correct" br$call$random up front. But you can simply modify it after the fact via:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand)
br$call$random <- formula.rand
br$call$random
#> ~(1 | fac)
and the continue with whatever you are doing.
IMHO, this is not a problem with stan_gamm4. In your second example, if you then do
class(br$call$random)
you will see that it is of class "name". So, it is not as if $call is just some list with stuff in it. In order to access it programatically in general, you need to evaluate that with
eval(br$call$random)
in order to obtain ~(1 | fac), which is of class "formula".
df <- data.frame(
disease = c(0,1,0,1),
var1 = c(0,1,2,0),
var2 =c(0,1,2,0),
var3 = c(0,1,2,0),
var40 = c(0,1,2,0),
Bi = c(0,1,0,1),
gender = c(1,0,1,0),
P1 = c(-0.040304832,0.006868288,0.002663759,0.020251087),
P2 = c(0.010566526,0.002663759,0.017480721,-0.008685749),
P3 = c(-0.008685749,0.020251087,-0.040304832,0.002663759),
P4 = c(0.017480721,0.024306667,0.002663759,0.010566526),
stringsAsFactors = FALSE)
The above data frame (df) consists of categorical and numerical variables namely; Disease, Bi and gender with labels 0,1, while var1 to var40 consists of a labels of 0,1,2, whereas PC1,PC2,PC3,PC4 consists of continuous numerical variables. The code for glm model for one variable will be:
glm(disease ~ var1*Bi+ gender+P1+P2+P3+P4, family = binomial(link
= 'logit'), data = df)
I need some help to write a loop that automatically performs the multivariate regression analysis for Disease versus variant1(var1) to Variant40(var) with same covariates namely; Bi, gender, P1, P2,P3,P4. I was doing something like below mentioned loop for all 40 variants but it's not working :
for (i in df$var1:df$var40) {glm(DepVar1 ~ i*Bi+gender+P1+P2+P3+P4, data=df,
family=binomial("logit")) }
Buyilding formulas dynamically can be a bit trickly, but there are functions like update() and reformulate() that can help. For example
results <- Map(function(i) {
newform <- update(disease ~ Bi+gender+P1+P2+P3+P4, reformulate(c(".", i)))
glm(newform, data=df, family=binomial("logit"))
}, names(subset(df, select=var1:var40)))
Here we use Map rather than a for loop so it's easier to save the results (they will be put into a list in with this method). But we use update() to add in the new variables of interest to the base formula. So for example
update(disease ~ Bi+gender+P1+P2+P3+P4, ~ . + var1)
# disease ~ Bi + gender + P1 + P2 + P3 + P4 + var1
this adds a variable to the right hand side. We use reformulate() to turn the name of the column as a string into a formula.
you can get all the models out from the list with
results$var1
results$var40
# etc
I am trying to set the formula for GLM as the ensemble of columns in train - train$1:99:
model <- glm(train$100 ~ train$1:99, data = train, family = "binomial")
Can't figure to find the right way to do it in R...
If you need outcome ~ var1 + var2 + ... + varN, then try this:
# Name of the outcome column
f1 <- colnames(train)[100]
# Other columns seperated by "+"
f2 <- paste(colnames(train)[1:99], collapse = "+")
#glm
model <- glm(formula = as.formula(paste(f1, f2, sep = "~")),
data = train,
family = "binomial")
The simplest way, assuming that you want to use all but column 100 as predictor variables, is
model <- glm(v100 ~. , data = train, family = "binomial")
where v100 is the name of the 100th column (the name can't be 100 unless you have done something advanced/sneaky to subvert R's rules about data frame column names ...)
I am trying to set the formula for GLM as the ensemble of columns in train - train$1:99:
model <- glm(train$100 ~ train$1:99, data = train, family = "binomial")
Can't figure to find the right way to do it in R...
If you need outcome ~ var1 + var2 + ... + varN, then try this:
# Name of the outcome column
f1 <- colnames(train)[100]
# Other columns seperated by "+"
f2 <- paste(colnames(train)[1:99], collapse = "+")
#glm
model <- glm(formula = as.formula(paste(f1, f2, sep = "~")),
data = train,
family = "binomial")
The simplest way, assuming that you want to use all but column 100 as predictor variables, is
model <- glm(v100 ~. , data = train, family = "binomial")
where v100 is the name of the 100th column (the name can't be 100 unless you have done something advanced/sneaky to subvert R's rules about data frame column names ...)