Subsetting with dredge function (MuMin) - r

I'm trying to subset a series of models dredged from a global model that has both linear & non-linear terms. There are no interactions e.g.
Glblm <- Y ~ X1 + X2 + X3 + I(X3^2) + X4 + X5 + X6 + I(X6^2) + X7 + I(X7^2)
I want to specify that X3^2 should never appear without X3, but X3 could appear alone without X3^2 (and the same for X6 & X7).
I have tried the following as I understood from the documentation:
ssm <-dredge (Glblm, subset=(X3| !I(X3^2)) && (X6| !I(X6^2)) && (X7| !I(X7^2)))
I also tried making a subset first as I read https://stackoverflow.com/questions/55252019/dredge-in-mumin-r-keeps-models-with-higher-order-terms-without-their-respectiv
e.g.
hbfsubset <- expression( dc(X3, `I(X3^2)`) & dc(`X6`, `I(X6^2)`)& dc(`X7`, `I(X7^2)`))
ssm <-dredge (Glblm, subset=hbfsubset)
neither has produced a subset of models, instead the full list of models is returned when inspecting 'ssm' using:
model.sel(ssm)
Any help would be greatly appreciated.

A reproducible example from you is needed to pinpoint the issue, specifying what type of model you are fitting.
In simple linear models (lm, those examples provided in MuMIn handbook), the name of fitted terms is exactly the same as what you typed in the global model, but this may not be the case in more complex models (e.g. glmmTMB).
Here is an example:
library(MuMIn)
library(glmmTMB)
# a simple linear model, using Cement data from MuMIn
m1 <- lm(y ~ X1 + I(X1^2) + X2 + I(X2^2), data = Cement, na.action = "na.fail")
# dredge without a subset
d1 <- dredge(m1)
# 16 models produced
# dredge with a subset
d1_sub <- dredge(m1, subset = dc(`X1`, `I(X1^2)`) & dc(`X2`, `I(X2^2)`))
# 9 models produced, works totally fine
# a glmmTMB linear model
m2 <- glmmTMB(y ~ X1 + I(X1^2) + X2 + I(X2^2), data = Cement, na.action = "na.fail")
# dredge without a subset
d2 <- dredge(m2)
# 16 models produced
# dredge with a subset
d2_sub <- dredge(m2, subset = dc(`X1`, `I(X1^2)`) & dc(`X2`, `I(X2^2)`))
# 16 models produced, subset didn't work and no warning or error produced
# this is because the term names of a glmmTMB object in dredge() is not the same as the typed global model anymore:
names(d2_sub)
# [1] "cond((Int))" "disp((Int))" "cond(X1)" "cond(I(X1^2))" "cond(X2)" "cond(I(X2^2))" "df" "logLik" "AICc"
# [10] "delta" "weight"
# e.g., now the X1 in the typed global model is actually called cond(X1)
# what will work for glmmTMB:
d2_sub <- dredge(m2, subset = dc(`cond(X1)`, `cond(I(X1^2))`) & dc(`cond(X2)`, `cond(I(X2^2))`))
# 9 models produced

Related

How to loop an ANCOVA test across multiple variables? [duplicate]

I am new to R and I want to improve the following script with an *apply function (I have read about apply, but I couldn't manage to use it). I want to use lm function on multiple independent variables (which are columns in a data frame). I used
for (i in (1:3) {
assign(paste0('lm.',names(data[i])), lm(formula=formula(i),data=data))
}
Formula(i) is defined as
formula=function(x)
{
as.formula ( paste(names(data[x]),'~', paste0(names(data[-1:-3]), collapse = '+')), env=parent.frame() )
}
Thank you.
If I don't get you wrong, you are working with a dataset like this:
set.seed(0)
dat <- data.frame(y1 = rnorm(30), y2 = rnorm(30), y3 = rnorm(30),
x1 = rnorm(30), x2 = rnorm(30), x3 = rnorm(30))
x1, x2 and x3 are covariates, and y1, y2, y3 are three independent response. You are trying to fit three linear models:
y1 ~ x1 + x2 + x3
y2 ~ x1 + x2 + x3
y3 ~ x1 + x2 + x3
Currently you are using a loop through y1, y2, y3, fitting one model per time. You hope to speed the process up by replacing the for loop with lapply.
You are on the wrong track. lm() is an expensive operation. As long as your dataset is not small, the costs of for loop is negligible. Replacing for loop with lapply gives no performance gains.
Since you have the same RHS (right hand side of ~) for all three models, model matrix is the same for three models. Therefore, QR factorization for all models need only be done once. lm allows this, and you can use:
fit <- lm(cbind(y1, y2, y3) ~ x1 + x2 + x3, data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
If you check str(fit), you will see that this is not a list of three linear models; instead, it is a single linear model with a single $qr object, but with multiple LHS. So $coefficients, $residuals and $fitted.values are matrices. The resulting linear model has an additional "mlm" class besides the usual "lm" class. I created a special mlm tag collecting some questions on the theme, summarized by its tag wiki.
If you have a lot more covariates, you can avoid typing or pasting formula by using .:
fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
Caution: Do not write
y1 + y2 + y3 ~ x1 + x2 + x3
This will treat y = y1 + y2 + y3 as a single response. Use cbind().
Follow-up:
I am interested in a generalization. I have a data frame df, where first n columns are dependent variables (y1,y2,y3,....) and next m columns are independent variables (x1+x2+x3+....). For n = 3 and m = 3 it is fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)). But how to do this automatically, by using the structure of the df. I mean something like (for i in (1:n)) fit <- lm(cbind(df[something] ~ df[something], data = dat)). That "something" I have created it with paste and paste0. Thank you.
So you are programming your formula, or want to dynamically generate / construct model formulae in the loop. There are many ways to do this, and many Stack Overflow questions are about this. There are commonly two approaches:
use reformulate;
use paste / paste0 and formula / as.formula.
I prefer to reformulate for its neatness, however, it does not support multiple LHS in the formula. It also needs some special treatment if you want to transform the LHS. So In the following I would use paste solution.
For you data frame df, you may do
paste0("cbind(", paste(names(df)[1:n], collapse = ", "), ")", " ~ .")
A more nice-looking way is to use sprintf and toString to construct the LHS:
sprintf("cbind(%s) ~ .", toString(names(df)[1:n]))
Here is an example using iris dataset:
string_formula <- sprintf("cbind(%s) ~ .", toString(names(iris)[1:2]))
# "cbind(Sepal.Length, Sepal.Width) ~ ."
You can pass this string formula to lm, as lm will automatically coerce it into formula class. Or you may do the coercion yourself using formula (or as.formula):
formula(string_formula)
# cbind(Sepal.Length, Sepal.Width) ~ .
Remark:
This multiple LHS formula is also supported elsewhere in R core:
the formula method for function aggregate;
ANOVA analysis with aov.

fixest vs lm - diffrent results? (difference in difference)

I'm trying to do a 'classic' difference in difference with multiple time periods. The model I want to do is:
y = a + b1x1 + b2_treat + b3_period + b_4(treat*period) + u (eq.1)
So basically I'm testing different setups just to make sure I specify my model in the right way, using different packages. I want to use the fixest-package, so I tried to compare the estimates with the estimates from the standard lm()-package. The results, however, differ -- both coefficients and std.errors.
My question is:
Is either the lm_mod, lm_mod2 or the feols_mod regressions specified correctly (as in eq.1)?
If not, I would appreciate it if anyone can show me how to get the same results in lm() as in feols()!
# libraries
library(fixest)
library(modelsummary)
library(tidyverse)
# load data
data(base_did)
# make df for lm_mod with 5 as the reference-period
base_ref_5 <- base_did %>%
mutate(period = as.factor(period)) %>%
mutate(period = relevel(period, ref = 5))
# Notice that i use base_ref_5 for the lm model and base_did for the feol_mod.
lm_mod <- lm(y ~ x1 + treat*period, base_ref_5)
lm_mod2 <- lm(y ~ x1 + treat + period + treat*period, base_ref_5)
feols_mod <- feols(y ~ x1 + i(period, treat, ref = 5), base_did)
# compare models
models <- list("lm" = lm_mod,
"lm2" = lm_mod2,
"feols" = feols_mod)
msummary(models, stars = T)
**EDIT:**
the reason why I created base_ref_5 was so that both regressions would have period 5 as the reference period, if that was unclear.
**EDIT 2**:
added a third model (lm_mod2) which is much closer, but there is still a difference.
There are two issues, here.
In the lm() model, the period variable is interacted, but treated as a continuous numeric variable. In contrast, calling i(period, treat) treats period as a factor (this is explained clearly in the documentation).
The i() function only includes the interactions, and not the constitutive terms.
Here are two models to illustrate the parallels:
library(fixest)
data(base_did)
lm_mod <- lm(y ~ x1 + factor(period) * factor(treat), base_did)
feols_mod <- feols(y ~ x1 + factor(period) + i(period, treat), base_did)
coef(lm_mod)["x1"]
#> x1
#> 0.9799697
coef(feols_mod)["x1"]
#> x1
#> 0.9799697
Please note that I only answered the part of your question about parallels between lm and feols. StackOverflow is a programming Q&A site. If you have questions about the proper specification of a statistical model, you might want to ask on CrossValidated.

doing many lm() models at once in R [duplicate]

I am new to R and I want to improve the following script with an *apply function (I have read about apply, but I couldn't manage to use it). I want to use lm function on multiple independent variables (which are columns in a data frame). I used
for (i in (1:3) {
assign(paste0('lm.',names(data[i])), lm(formula=formula(i),data=data))
}
Formula(i) is defined as
formula=function(x)
{
as.formula ( paste(names(data[x]),'~', paste0(names(data[-1:-3]), collapse = '+')), env=parent.frame() )
}
Thank you.
If I don't get you wrong, you are working with a dataset like this:
set.seed(0)
dat <- data.frame(y1 = rnorm(30), y2 = rnorm(30), y3 = rnorm(30),
x1 = rnorm(30), x2 = rnorm(30), x3 = rnorm(30))
x1, x2 and x3 are covariates, and y1, y2, y3 are three independent response. You are trying to fit three linear models:
y1 ~ x1 + x2 + x3
y2 ~ x1 + x2 + x3
y3 ~ x1 + x2 + x3
Currently you are using a loop through y1, y2, y3, fitting one model per time. You hope to speed the process up by replacing the for loop with lapply.
You are on the wrong track. lm() is an expensive operation. As long as your dataset is not small, the costs of for loop is negligible. Replacing for loop with lapply gives no performance gains.
Since you have the same RHS (right hand side of ~) for all three models, model matrix is the same for three models. Therefore, QR factorization for all models need only be done once. lm allows this, and you can use:
fit <- lm(cbind(y1, y2, y3) ~ x1 + x2 + x3, data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
If you check str(fit), you will see that this is not a list of three linear models; instead, it is a single linear model with a single $qr object, but with multiple LHS. So $coefficients, $residuals and $fitted.values are matrices. The resulting linear model has an additional "mlm" class besides the usual "lm" class. I created a special mlm tag collecting some questions on the theme, summarized by its tag wiki.
If you have a lot more covariates, you can avoid typing or pasting formula by using .:
fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
Caution: Do not write
y1 + y2 + y3 ~ x1 + x2 + x3
This will treat y = y1 + y2 + y3 as a single response. Use cbind().
Follow-up:
I am interested in a generalization. I have a data frame df, where first n columns are dependent variables (y1,y2,y3,....) and next m columns are independent variables (x1+x2+x3+....). For n = 3 and m = 3 it is fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)). But how to do this automatically, by using the structure of the df. I mean something like (for i in (1:n)) fit <- lm(cbind(df[something] ~ df[something], data = dat)). That "something" I have created it with paste and paste0. Thank you.
So you are programming your formula, or want to dynamically generate / construct model formulae in the loop. There are many ways to do this, and many Stack Overflow questions are about this. There are commonly two approaches:
use reformulate;
use paste / paste0 and formula / as.formula.
I prefer to reformulate for its neatness, however, it does not support multiple LHS in the formula. It also needs some special treatment if you want to transform the LHS. So In the following I would use paste solution.
For you data frame df, you may do
paste0("cbind(", paste(names(df)[1:n], collapse = ", "), ")", " ~ .")
A more nice-looking way is to use sprintf and toString to construct the LHS:
sprintf("cbind(%s) ~ .", toString(names(df)[1:n]))
Here is an example using iris dataset:
string_formula <- sprintf("cbind(%s) ~ .", toString(names(iris)[1:2]))
# "cbind(Sepal.Length, Sepal.Width) ~ ."
You can pass this string formula to lm, as lm will automatically coerce it into formula class. Or you may do the coercion yourself using formula (or as.formula):
formula(string_formula)
# cbind(Sepal.Length, Sepal.Width) ~ .
Remark:
This multiple LHS formula is also supported elsewhere in R core:
the formula method for function aggregate;
ANOVA analysis with aov.

Regression using plm package and twoways effect, when data has NA

So, I'd like to run a regression on a panel data, using two-ways effects, for time and stores. If the panel is perfectly balanced, it works fine, but for some reason, if it's not, the code gets stuck. (see: https://stat.ethz.ch/pipermail/r-help/2010-May/239272.html).
My data in particular is not unbalanced in nature, but it has some NAs, so I guess it's becoming unbalanced when the plm function removes rows with NA.
I wrote a sample code to exemplify the data I have.
If I run this:
set.seed(123)
library(plm)
number.of.days <- 1100
number.of.stores <- 1000
days <- sort(rep(c(1:number.of.days),number.of.stores))
stores <- rep(c(1:number.of.stores),number.of.days)
data <- cbind.data.frame(stores,days,matrix(rnorm(number.of.days*number.of.stores*7),nrow=number.of.days*number.of.stores,ncol=7))
colnames(data)[3:9] <- c('y',paste0('x',1:6))
data <- plm.data(data,c("stores","days"))
fit <- plm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = data, index=c("stores","days"), effect="twoway", model="within")
It works correctly, because the panel is balanced. However, if I create some NA values:
data$y[sample(1:number.of.days*number.of.stores,150)] <- NA
data$x1[sample(1:number.of.days*number.of.stores,150)] <- NA
data$x2[sample(1:number.of.days*number.of.stores,150)] <- NA
data$x3[sample(1:number.of.days*number.of.stores,150)] <- NA
data$x4[sample(1:number.of.days*number.of.stores,150)] <- NA
data$x5[sample(1:number.of.days*number.of.stores,150)] <- NA
data$x6[sample(1:number.of.days*number.of.stores,150)] <- NA
And try to run the regression again:
fit <- plm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = data, index=c("stores","days"), effect="twoway", model="within")
It does not work (the code apparently never stops running)
I tried using 'individual' effect for stores and adding a matrix with dummies for time, but since there are 1100 days, it becomes just as slow.
I assume this is not a rare problem. Is there any known solution?
Thank you
The felm function from the lfe package is able to handle this (and efficiently, too).
Running
fit2 <- felm(y ~ x1 + x2 + x3 + x4 + x5 + x6 | stores + days | 0 | stores , data = data)
on the data with the NAs yields a result.
Note the formula specification in which you specify which factors are to be projected out (i.e. the fixed effects). The last stores in the formula specifies the variable for clustering standard errors. For details see the excellent felm help file and lfe package documentation.

Fitting a linear model with multiple LHS

I am new to R and I want to improve the following script with an *apply function (I have read about apply, but I couldn't manage to use it). I want to use lm function on multiple independent variables (which are columns in a data frame). I used
for (i in (1:3) {
assign(paste0('lm.',names(data[i])), lm(formula=formula(i),data=data))
}
Formula(i) is defined as
formula=function(x)
{
as.formula ( paste(names(data[x]),'~', paste0(names(data[-1:-3]), collapse = '+')), env=parent.frame() )
}
Thank you.
If I don't get you wrong, you are working with a dataset like this:
set.seed(0)
dat <- data.frame(y1 = rnorm(30), y2 = rnorm(30), y3 = rnorm(30),
x1 = rnorm(30), x2 = rnorm(30), x3 = rnorm(30))
x1, x2 and x3 are covariates, and y1, y2, y3 are three independent response. You are trying to fit three linear models:
y1 ~ x1 + x2 + x3
y2 ~ x1 + x2 + x3
y3 ~ x1 + x2 + x3
Currently you are using a loop through y1, y2, y3, fitting one model per time. You hope to speed the process up by replacing the for loop with lapply.
You are on the wrong track. lm() is an expensive operation. As long as your dataset is not small, the costs of for loop is negligible. Replacing for loop with lapply gives no performance gains.
Since you have the same RHS (right hand side of ~) for all three models, model matrix is the same for three models. Therefore, QR factorization for all models need only be done once. lm allows this, and you can use:
fit <- lm(cbind(y1, y2, y3) ~ x1 + x2 + x3, data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
If you check str(fit), you will see that this is not a list of three linear models; instead, it is a single linear model with a single $qr object, but with multiple LHS. So $coefficients, $residuals and $fitted.values are matrices. The resulting linear model has an additional "mlm" class besides the usual "lm" class. I created a special mlm tag collecting some questions on the theme, summarized by its tag wiki.
If you have a lot more covariates, you can avoid typing or pasting formula by using .:
fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
Caution: Do not write
y1 + y2 + y3 ~ x1 + x2 + x3
This will treat y = y1 + y2 + y3 as a single response. Use cbind().
Follow-up:
I am interested in a generalization. I have a data frame df, where first n columns are dependent variables (y1,y2,y3,....) and next m columns are independent variables (x1+x2+x3+....). For n = 3 and m = 3 it is fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)). But how to do this automatically, by using the structure of the df. I mean something like (for i in (1:n)) fit <- lm(cbind(df[something] ~ df[something], data = dat)). That "something" I have created it with paste and paste0. Thank you.
So you are programming your formula, or want to dynamically generate / construct model formulae in the loop. There are many ways to do this, and many Stack Overflow questions are about this. There are commonly two approaches:
use reformulate;
use paste / paste0 and formula / as.formula.
I prefer to reformulate for its neatness, however, it does not support multiple LHS in the formula. It also needs some special treatment if you want to transform the LHS. So In the following I would use paste solution.
For you data frame df, you may do
paste0("cbind(", paste(names(df)[1:n], collapse = ", "), ")", " ~ .")
A more nice-looking way is to use sprintf and toString to construct the LHS:
sprintf("cbind(%s) ~ .", toString(names(df)[1:n]))
Here is an example using iris dataset:
string_formula <- sprintf("cbind(%s) ~ .", toString(names(iris)[1:2]))
# "cbind(Sepal.Length, Sepal.Width) ~ ."
You can pass this string formula to lm, as lm will automatically coerce it into formula class. Or you may do the coercion yourself using formula (or as.formula):
formula(string_formula)
# cbind(Sepal.Length, Sepal.Width) ~ .
Remark:
This multiple LHS formula is also supported elsewhere in R core:
the formula method for function aggregate;
ANOVA analysis with aov.

Resources