Limit on the Number of Variables for bestglm Function - r

I am trying to run the bestglm function in R for subset selection and the run fails immediately if I use more than 15 variables in the function. I attached some sample code below (I know these models have far too many variables for this dataset, I am just including these models here as an example):
cars.df = data.frame(mtcars)
resp.var = cars.df$mpg
ind.matrix.15 = model.matrix(mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb + disp:wt + drat:wt + qsec:am + gear:hp + cyl:disp + drat:gear, data = cars.df)[, -1]
matrix.xy.15 = data.frame(ind.matrix.15, y = as.matrix(resp.var))
bestglm(Xy = matrix.xy.15, family = gaussian(link = 'log'), nvmax = 15)
ind.matrix.16 = model.matrix(mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb + disp:wt + drat:wt + qsec:am + gear:hp + cyl:disp + drat:gear + disp:hp, data = cars.df)[, -1]
matrix.xy.16 = data.frame(ind.matrix.16, y = as.matrix(resp.var))
bestglm(Xy = matrix.xy.16, family = gaussian(link = 'log'), nvmax = 16)
The first bestglm function runs fine, but when I add an additional variable for a total of 16 features, the second bestglm function instantly produces this error message: p = 16. must be <= 15 for GLM.
Changing the method argument to a simpler algorithm such as backward rather than the default exhaustive does not make the error go away.
Is this just a limitation of the bestglm function, or is there an argument I can change to allow more than 15 features.

As #RomanLuštrik says, this is a hard-coded constraint in bestglm, presumably because 15 predictors means there are 2^15 = 32768 candidate models, and one has to stop somewhere ... as far as I can see there is no way around this constraint when running a GLM. (Roman's suggestion of RequireFullEnumerationQ=FALSE doesn't work, because the leaps-and-bound algorithm is only available for linear models, not GLMs.)
One possible strategy (not fully explored here) would be to fit the linear model exhaustively with leaps-and-bounds, save a large number of the top models (say TopModels=1000) and then re-evaluate the top models with your preferred variance structure ... this doesn't work directly in leaps, but can be hacked as follows:
leaps.obj <- leaps:::leaps.setup(matrix.xy.16,y=cars.df$mpg,nvmax=16,
bb <- leaps:::leaps.exhaustive(leaps.obj, really.big=TRUE)
but I don't know (and it seems like a lot of work) to figure out how to re-evaluate these models with a log-link Gaussian.
You might be able to get the glmulti package to work (it offers both method="h" for full enumeration and method="g" for a genetic algorithm), but so far I haven't managed to overcome some Java errors ...
Unfortunately, the J Stat Software article describing glmulti shows that this method has some of the same constraints:
For performance, the Java classes encode formulas as compact bit strings. Currently two integers (32 bits each) are used for main effects, and two long integers (128 bits) are used for each category of interaction terms (factor:factor, covariate:covariate, and factor:covariate),to encode models. This means that there can be at most 32 factors and 32 covariates, and, if including interactions, at most 128 interactions of each category. The latter constraint necessitates that, if x is the number of factors and y the number of covariates:x <16y <16xy <128


translating code from glmer to gam (general additive model)

I was using the glmer code for a logistic regression model with 2.5 million observations. However, after I added the multi-level component (a few hundred thousand groups), the data was too large to run in a timely manner on my computer. I want to try a general additive model instead, but I am confused about how to write the code.
The glmer code is as follows:
mylogit.m1a <- glmer(outcome ~
exposure*risk+ tenure.yr + CurrentAge + + employment + rentership + pop.change + pop.den.k +
(1 | geo_id / house_id),
data = temp, family = "binomial", control = glmerControl(optimizer="bobyqa", calc.derivs=FALSE))
The example I found writes the gam like this:
ga_model = gam(
Reaction ~ Days + s(Subject, bs = 're') + s(Days, Subject, bs = 're'),
data = sleepstudy,
method = 'REML'
But I am confused about why there are two bits in parenthesis / what I should put in parenthesis to specify the model correctly.
The details are given in ?
Exactly how the random effects are implemented is best seen by
example. Consider the model term ‘s(x,z,bs="re")’. This will
result in the model matrix component corresponding to ‘~x:z-1’
being added to the model matrix for the whole model.
So s(Days, Subject, bs = "re") is equivalent to the (0 + Days|Subject) term in the lmer model: both of them encode "random variation in slope with respect to day across subjects"
So your (1 | geo_id / house_id) would be translated to mgcv syntax as
s(geo_id, bs = "re") + s(geo_id, house_id, bs = "re")
(the nesting syntax a/b expands in general to a + a:b).
A couple of other comments:
you should probably use bam() as a drop-in replacement for gam() (much faster)
you may very well run into problems with memory usage: mgcv doesn't use sparse matrices for the random effects terms, so they can get big

Margins Package error using quadratic and interaction terms

I have code which uses the margins command in Stata and I am trying to replicate it in R using the "margins" package found here and on cran.
I keep getting the error:
Error in names(classes) <- clean_terms(names(classes)) : 'names' attribute [18] must be the same length as the vector [16]"
A minimum reproducible example is show below:
mod1 <- lm(log(mpg) ~ vs + cyl + hp + vs*hp + I(vs*hp*hp) + wt + I(hp*hp), data = mtcars)
(marg1 <- margins(mod1))
I need vs to be a dummy variable interacted with both a quadratic term and a normal interaction.
Does anyone know what I am doing wrong or if there is a way around this?
Your model specification is a bit confusing. For example, vs*hp introduces 3 variables: i) vs, ii) hp and iii) interaction vs and hp. As a result, hp appears twice in the formula you provided. You can simplify massively! Try this for example (I think it is what you want):
mtcars$hp2 = mtcars$hp^2
mod1 <- lm(log(mpg) ~ cyl + wt + vs*hp + vs*hp2, data = mtcars)
summary(mod1) # With this you can check that the model you specified is what you want
(marg1 <- margins(mod1)) # The error disappeared.
In general, I would recommend you to avoid I() in formula specifications, as it often gives rise to such errors when not treated with enough care (though sometimes one cannot avoid it). Good luck!

Creating new formula with one extra variable in R

So I would like to created a new formula in R based on another formula, the difference should only be in one additional variable:
For example I have:
formula = as.formula(price ~ speed + hp + mpg)
formula2 = as.formula(paste0(format(formula), "+ factor(DEPARTMENT)-1"))
However Code is not working the results I want is:
formula2 = price ~ speed + hp + mpg + factor(DEPARTMENT0) -1
?update.formula seems to be exactly what you are looking for !

PPML package gravity with time fixed effects

I'm trying to include time fixed effects (dummies for years generated with model.matrix) into a PPML regression in R.
Without time fixed effect the regression is:
my_model <- PPML(y="v", dist="dist",
vce_robust=T, data=database)
I've tried to add command fe=c("year") within the PPML function but it doesn't work.
I'd appreciate any help on this.
I would comment on the previous answer but don't have enough reputation. The gravity model in your PPML command specifies v = dist × exp(land + contig + comlang_ethno + smctry + tech + exrate + TimeFE) = exp(log(dist) + land + contig + comlang_ethno + smctry + tech + exrate + TimeFE).
The formula inside of glm should have as its RHS the variables inside the exponential, because it represents the linear predictor produced by the link function (the Poisson default for which is natural log). So in sum, your command should be
glm(v ~ log(dist) + land + contig + comlang_ethno + smctry + tech + exrate + factor(year),
and in particular, you need to have distance in logs on the RHS (unlike the previous answer).
Just make sure that year is a factor, than you can just use the plain-and-simple glm-function as
glm(y ~ dist + year, family = "quasipoisson")
which gives you the results with year as dummies/fixed effects. The robust SE are then calculated with
lmtest::coeftest(EstimationResults.PPML, vcov=sandwich::vcovHC(model.PPML, "HC1"))
The PPML function does nothing more, it just isn't very flexible.
Alternatively to PPML and glm, you can also solve your problem using the function femlm (from package FENmlm) which deals with fixed-effect estimation for maximum likelihood models.
The two main advantages of function femlm are:
you can add as many fixed-effects as you want, and they are dealt with separately leading to computing times without comparison to glm (especially when fixed-effects contain many categories)
standard-errors can be clustered with intuitive commands
Here's an example regarding your problem (with just two variables and the year fixed-effects):
# (default family is Poisson, 'pipe' separates variables from fixed-effects)
res = femlm(v ~ log(dist) + land | year, base)
summary(res, se = "cluster")
This code estimates the coefficients of variables log(dist) and land with year fixed-effects; then it displays the coefficients table with clustered standard-errors (w.r.t. year) for the two variables.
Going beyond your initial question, now assume you have a more complex case with three fixed-effects: country_i, country_j and year. You'd write:
res = femlm(v ~ log(dist) + land | country_i + country_j + year, base)
You can then easily play around with clustered standard-errors:
# Cluster w.r.t. country_i (default is first cluster encountered):
summary(res, se = "cluster")
summary(res, se = "cluster", cluster = "year") # cluster w.r.t. year cluster
# Two-way clustering:
summary(res, se = "twoway") # two-way clustering w.r.t. country_i & country_j
# two way clustering w.r.t. country_i & year:
summary(res, se = "twoway", cluster = c("country_i", "year"))
For more information on the package, the vignette can be found at

Robust standard error estimation for the Hausman-Taylor estimator using plm() and vcovHC()

Suppose I compute the Hausman-Taylor estimator using the plm command with the option: model= "ht". Using the result I like to obtain a robust variance-covariance matrix to make inference fully robust. For this purpose the vcovHC() command (part of the plm package) is used. Here is a minimal example:
data("Wages", package = "plm")
ht <- plm(lwage ~ wks + south + smsa + married + exp + I(exp^2) +
bluecol + ind + union + sex + black + ed |
sex + black + bluecol + south + smsa + ind,
data = Wages, model = "ht", index = 595)
vcvHT <- vcovHC(ht,method="arellano")
Error in vcovHC.plm(ht, method = "arellano") :
Model has to be either random, within or pooling model
Technically, as the error message indicates, vcovHC() is unable to compute the VCV matrix since it does not support models of the type computed by plm(...,model="ht")
My question is this:
Why doesn't vcovHC() support the Hausman-Taylor model? Is it because standard error based on a (cluster) robust VCV matrix shouldn't be used for theoretical reasons (inconsistent etc.), or is it simply not implemented but save to use (if programmed by hand)?
It is currently not implemented (yet); but as HT is a special kind of IV, it should in principle be possible to compute an HC covariance. I will get around to doing it sometime. A production version requires a lot of interface work and consideration of all possible cases; but an ad-hoc function might be relatively easy to write, based on components from the model object.
