metafor coupled with glmulti: How to exclude undesirable interactions - r

I am fitting the following:
rma.glmulti.ran <- function (formula, data, random, ...) {
rma.mv(as.formula(paste(deparse(formula))), Variance, random= ~1 | Experiment, data = msa, method="REML", ...)
}
msa_res <- glmulti(MSA ~ MAPl+MAT_e+Duration.yrl+Fert+Naddl+Ndepl,
data=msa,
level=2,
exclude=c("MAPl:MAT_e","MAPl:Duration.yrl","MAPl:Fert",
"MAPl:Naddl","MAPl:Ndepl","MAT_e:Duration.yrl","MAT_e:Fert","MAT_e:depl",
"Duration.yrl:Fert","Duration.yrl:Ndepl","Duration.yrl:Naddl","Fert:Ndepl","Naddl:Ndepl"),
fitfunction=rma.glmulti.ran, crit="aicc")
The aim of this code is to include only these two interactions: "Naddl:MAT_e" and "Naddl:Fert". Thus, I am using exclude=c() to filter out all the other undesirable pairwise interactions from the full model (level=2).
This, theoretically should be the same than:
MSA ~ MAPl + MAT_e + ... + Naddl:MAT_e + Naddl:Fert
However I get this error when I add exclude=c() in the formula:
Error in glmulti(MSA ~ MAPl + MAT_e + Duration.yrl + Fert + Naddl + Ndepl, :
Improper call of glmulti.
Am I missing something about exclude=c()? Is there a more elegant way to specify interactions terms in 'glmulti'?

Couldn't find the data you were referring to, so i generated fictitious variables below;
The exclude didn't work for me too, so i hard coded the exclusion into the formula. The following code worked for me, and no undesired interactions went into the glmulti.
library('utils')
retain_var <- c('DOW_MON','DOW_TUE','DOW_WED','DOW_THU') #set variables interested for testing
excl_inter_effects <- c('DOW_FRI','DOW_SAT','DOW_SUN') #set variables to exclude
set_depth <- 2 #generate 2 way interaction to exclude explicitly
excl_inter_form <- t(combn(excl_inter_effects,set_depth))
Paste <- function(x) paste(x, collapse = ":")
excl_inter_form <- apply(excl_inter_form, 1, Paste)
rm(list = c('Paste','set_depth','excl_inter_effects')) #clean up
# ----- Generate formula incorporating exclusion -----
glm_formula <- as.formula(paste("y ~",paste(retain_var, collapse= "+"),"-",paste(excl_inter_form, collapse= "-")))

Related

Error including correlation structure in function with gamm

I am trying to create my own function that contains 1.) the mgcv gamm function and 2.) a nested autocorrelation (ARMA) argument. I am getting an error when I try to run the function like this:
df <- AirPassengers
df <- as.data.frame(df)
df$month <- rep(1:12)
df$yr <- rep(1949:1960,each=12)
df$datediff <- 1:nrow(df)
try_fxn1 <- function(dfz, colz){gamm(dfz[[colz]] ~ s(month, bs="cc",k=12)+s(datediff,bs="ts",k=20), data=dfz,correlation = corARMA(form = ~ 1|yr, p=2))}
try_fxn1(df,"x")
Error in eval(predvars, data, env) : object 'dfz' not found
I know the issue is with the correlation portion of the formula, as when I run the same function without the correlation structure included (as seen below), the function behaves as expected.
try_fxn2 <- function(dfz, colz){gamm(dfz[[colz]] ~ s(month, bs="cc",k=12)+ s(datediff,bs="ts",k=20), data=dfz)}
try_fxn2(df,"x")
Any ideas on how I can modify try_fxn1 to make the function behave as expected? Thank you!
You are confusing a vector with the symbolic representation of that vector when building a formula.
You don't want dfz[[colz]] as the response in the formula, you want x or whatever you set colz too. What you are getting is
dfz[[colz]] ~ ...
when what you really want is the variable colz:
colz ~ ...
And you don't want a literal colz but whatever colz evaluates to. To do this you can create a formula by pasting the parts together:
fml <- paste(colz, '~ s(month, bs="cc", k=12) + s(datediff,bs="ts",k=20)')
This turns colz into whatever it was storing, not the literal colz:
> fml
[1] "x ~ s(month, bs=\"cc\", k=12) + s(datediff,bs=\"ts\",k=20)"
Then convert the string into a formula object using formula() or as.formula().
The final solution then is:
fit_fun <- function(dfz, colz) {
fml <- paste(colz, '~ s(month, bs="cc", k=12) + s(datediff,bs="ts",k=20)')
fml <- formula(fml)
gamm(fml, data = df, correlation = corARMA(form = ~ 1|yr, p=2))
}
This really is not an issue with corARMA() part, other than that triggers somewhat different evaluation code for the formula. The guiding mantra here is to always get a formula as you would type it if not programming with formulas. You would never (or should never) write a formula like
gamm(df[[var]] ~ x + s(z), ....)
While this might work in some settings, it will fail miserably if you ever want to use predict()` and it fails when you have to do something a little more complicated.

Remove linear dependent variables while using the bife package

Some pre-programmed models automatically remove linear dependent variables in their regression output (e.g. lm()) in R. With the bife package, this does not seem to be possible. As stated in the package description in CRAN on page 5:
If bife does not converge this is usually a sign of linear dependence between one or more regressors
and the fixed effects. In this case, you should carefully inspect your model specification.
Now, suppose the problem at hand involves doing many regressions and one cannot inspect adequately each regression output -- one has to suppose some sort of rule-of-thumb regarding the regressors. What could be some of the alternatives to remove linear dependent regressors more or less automatically and achieve an adequate model specification?
I set a code as an example below:
#sample coding
x=10*rnorm(40)
z=100*rnorm(40)
df1=data.frame(a=rep(c(0,1),times=20), x=x, y=x, z=z, ID=c(1:40), date=1, Region=rep(c(1,2, 3, 4),10))
df2=data.frame(a=c(rep(c(1,0),times=15),rep(c(0,1),times=5)), x=1.4*x+4, y=1.4*x+4, z=1.2*z+5, ID=c(1:40), date=2, Region=rep(c(1,2,3,4),10))
df3=rbind(df1,df2)
df3=rbind(df1,df2)
for(i in 1:4) {
x=df3[df3$Region==i,]
model = bife::bife(a ~ x + y + z | ID, data = x)
results=data.frame(Region=unique(df3$Region))
results$Model = results
if (i==1){
df4=df
next
}
df4=rbind(df4,df)
}
Error: Linear dependent terms detected!
Since you're only looking at linear dependencies, you could simply leverage methods that detect them, like for instance lm.
Here's an example of solution with the package fixest:
library(bife)
library(fixest)
x = 10*rnorm(40)
z = 100*rnorm(40)
df1 = data.frame(a=rep(c(0,1),times=20), x=x, y=x, z=z, ID=c(1:40), date=1, Region=rep(c(1,2, 3, 4),10))
df2 = data.frame(a=c(rep(c(1,0),times=15),rep(c(0,1),times=5)), x=1.4*x+4, y=1.4*x+4, z=1.2*z+5, ID=c(1:40), date=2, Region=rep(c(1,2,3,4),10))
df3 = rbind(df1, df2)
vars = c("x", "y", "z")
res_all = list()
for(i in 1:4) {
x = df3[df3$Region == i, ]
coll_vars = feols(a ~ x + y + z | ID, x, notes = FALSE)$collin.var
new_fml = xpd(a ~ ..vars | ID, ..vars = setdiff(vars, coll_vars))
res_all[[i]] = bife::bife(new_fml, data = x)
}
# Display all results
for(i in 1:4) {
cat("\n#\n# Region: ", i, "\n#\n\n")
print(summary(res_all[[i]]))
}
The functions needed here are feols and xpd, the two are from fixest. Some explanations:
feols, like lm, removes variables on-the-fly when they are found to be collinear. It stores the names of the collinear variables in the slot $collin.var (if none is found, it's NULL).
Contrary to lm, feols also allows fixed-effects, so you can add it when you look for linear dependencies: this way you can spot complex linear dependencies that would also involve the fixed-effects.
I've set notes = FALSE otherwise feols would have prompted a note referring to collinearity.
feols is fast (actually faster than lm for large data sets) so won't be a strain on your analysis.
The function xpd expands the formula and replaces any variable name starting with two dots with the associated argument that the user provide.
When the arguments of xpd are vectors, the behavior is to coerce them with pluses, so if ..vars = c("x", "y") is provided, the formula a ~ ..vars | ID will become a ~ x + y | ID.
Here it replaces ..vars in the formula by setdiff(vars, coll_vars)), which is the vector of variables that were not found to be collinear.
So you get an algorithm with automatic variable removal before performing bife estimations.
Finally, just a side comment: in general it's better to store results in lists since it avoids copies.
Update
I forgot, but if you don't need bias correction (bife::bias_corr), then you can directly use fixest::feglm which automatically removes collinear variables:
res_bife = bife::bife(a ~ x + z | ID, data = df3)
res_feglm = fixest::feglm(a ~ x + y + z | ID, df3, family = binomial)
rbind(coef(res_bife), coef(res_feglm))
#> x z
#> [1,] -0.02221848 0.03045968
#> [2,] -0.02221871 0.03045990

Pasting object names inside functions

This is a follow-up question to this (see data and previous commands).
Starting with a list of models in mods, i am now able to find the model with the least AIC (corresponds to the best model):
mods <- lapply(methods, function(m)
update(amod.null, correlation = getFunction(m)(1, form = ~ x + y), method="ML"))
names(mods) <- methods
list.AIC <- lapply(mods, function(x) AIC(x))
best.mod <- names(which.min(list.AIC))
Now, i need to do some testing on the model, e.g. Tukey between dates. The syntax is very simple, e.g. for amod.null
library(multcomp)
res <- glht(amod.null, mcp(Date = "Tukey"))
The tricky part is, how can i tell glht to use the model which was put into best.mod (note: this is all happening within a loop). I tried
res <- glht(paste("mods$", as.factor(best.mod),sep = "") , mcp(Date = "Tukey"))
but to no avail, as glht needs to find a model-object in the first argument.
/edit:
Possibly useful:
names(mods)
[1] "corExp" "corGaus" "corLin" "corRatio" "corSpher"
Since the models are stored in the list mods, you can access the "best model" by using the index of which.min(list.AIC):
list.AIC <- sapply(mods, AIC)
best.mod <- mods[which.min(list.AIC)]
best.mod[[1]]

Use string of independent variables within the lm function

I have a dataframe with many variables. I want to apply a linear regression to explain the last one with the others. So as I had to much to write I thought about creating a string with the independent variables e.g. Var1 + Var2 +...+ VarK. I achieved it pasting "+" to all column names except for the last one with this code:
ExVar <- toString(paste(names(datos)[1:11], "+ ", collapse = ''))
I also had to remove the last "+":
ExVar <- substr(VarEx, 1, nchar(ExVar)-2)
So I copied and pasted the ExVar string within the lm() function and the result looked like this:
m1 <- lm(calidad ~ Var1 + Var 2 +...+ Var K)
The question is: Is there any way to use "ExVar" within the lm() function as a string, not as a variable, to have a cleaner code?
For better understanding:
If I use this code:
m1 <- lm(calidad ~ ExVar)
It is interpreting ExVar as a independent variable.
The following will all produce the same results. I am providing multiple methods because there is are simpler ways of doing what you are asking (see examples 2 and 3) instead of writing the expression as a string.
First, I will generate some example data:
n <- 100
p <- 11
dat <- array(rnorm(n*p),c(n,p))
dat <- as.data.frame(dat)
colnames(dat) <- paste0("X",1:p)
If you really want to specify the model as a string, this example code will help:
ExVar <- toString(paste(names(dat[2:11]), "+ ", collapse = ''))
ExVar <- substr(ExVar, 1, nchar(ExVar)-3)
model1 <- paste("X1 ~ ",ExVar)
fit1 <- lm(eval(parse(text = model1)),data = dat)
Otherwise, note that the 'dot' notation will specify all other variables in the model as predictors.
fit2 <- lm(X1 ~ ., data = dat)
Or, you can select the predictors and outcome variables by column, if your data is structured as a matrix.
dat <- as.matrix(dat)
fit3 <- lm(dat[,1] ~ dat[,-1])
All three of these fit objects have the same estimates:
fit1
fit2
fit3
if you have a dataframe, and you want to explain the last one using all the rest then you can use the code below:
lm(calidad~.,dat)
or you can use
lm(rev(dat))#Only if the last column is your response variable
Any of the two above will give you the results needed.
To do it your way:
EXV=as.formula(paste0("calidad~",paste0(names(datos)[-12],collapse = '+')))
lm(EXV,dat)
There is no need to do it this way since the lm function itself will do this by using the first code above.

Specifying formula in R with glm without explicit declaration of each covariate

I would like to force specific variables into glm regressions without fully specifying each one. My real data set has ~200 variables. I haven't been able to find samples of this in my online searching thus far.
For example (with just 3 variables):
n=200
set.seed(39)
samp = data.frame(W1 = runif(n, min = 0, max = 1), W2=runif(n, min = 0, max = 5))
samp = transform(samp, # add A
A = rbinom(n, 1, 1/(1+exp(-(W1^2-4*W1+1)))))
samp = transform(samp, # add Y
Y = rbinom(n, 1,1/(1+exp(-(A-sin(W1^2)+sin(W2^2)*A+10*log(W1)*A+15*log(W2)-1+rnorm(1,mean=0,sd=.25))))))
If I want to include all main terms, this has an easy shortcut:
glm(Y~., family=binomial, data=samp)
But say I want to include all main terms (W1, W2, and A) plus W2^2:
glm(Y~A+W1+W2+I(W2^2), family=binomial, data=samp)
Is there a shortcut for this?
[editing self before publishing:] This works! glm(formula = Y ~ . + I(W2^2), family = binomial, data = samp)
Okay, so what about this one!
I want to omit one main terms variable and include only two main terms (A, W2) and W2^2 and W2^2:A:
glm(Y~A+W2+A*I(W2^2), family=binomial, data=samp)
Obviously with just a few variables no shortcut is really needed, but I work with high dimensional data. The current data set has "only" 200 variables, but some others have thousands and thousands.
Your use of . creatively to build the formula containing all or almost all variables is a good and clean approach. Another option that is useful sometimes is to build the formula programatically as a string, and then convert it to formula using as.formula:
vars <- paste("Var",1:10,sep="")
fla <- paste("y ~", paste(vars, collapse="+"))
as.formula(fla)
Of course, you can make the fla object way more complicated.
Aniko answered your question. To extend a bit :
You can also exclude variables using - :
glm(Y~.-W1+A*I(W2^2), family=binomial, data=samp)
For large groups of variables, I often make a frame for grouping the variables, which allows you to do something like :
vars <- data.frame(
names = names(samp),
main = c(T,F,T,F),
quadratic =c(F,T,T,F),
main2=c(T,T,F,F),
stringsAsFactors=F
)
regform <- paste(
"Y ~",
paste(
paste(vars[vars$main,1],collapse="+"),
paste(vars[1,1],paste("*I(",vars[vars$quadratic,1],"^2)"),collapse="+"),
sep="+"
)
)
> regform
[1] "Y ~ W1+A+W1 *I( W2 ^2)+W1 *I( A ^2)"
> glm(as.formula(regform),data=samp,family=binomial)
Using all kind of conditions (on name, on structure, whatever) to fill the dataframe, allows me to quickly select groups of variables in large datasets.

Resources