Creating new formula with one extra variable in R - r

So I would like to created a new formula in R based on another formula, the difference should only be in one additional variable:
For example I have:
formula = as.formula(price ~ speed + hp + mpg)
formula2 = as.formula(paste0(format(formula), "+ factor(DEPARTMENT)-1"))
However Code is not working the results I want is:
formula2 = price ~ speed + hp + mpg + factor(DEPARTMENT0) -1

?update.formula seems to be exactly what you are looking for !
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/update.formula.html

Related

Error when using regr() command: undefined columns selected

I get the following error when trying to run the regr() command from the yhat package:
Error in `[.data.frame`(new.data, , c(DV, IVx)) :
undefined columns selected
Here is the code I'm using:
DEregr_model <- lm(TotalBiomass ~ propnC + propnV + propnR + I(propnC^2) + I(propnV^2) + propnC:propnV + propnV:propnR + propnV:I(propnC^2), DE_model)
DEregrout <- regr(DEregr_model)
Why is this function returning an error?
I think I can demonstrate my suspicion expressed in the comments with this MCVE:
> lm.gas <- lm( mpg ~ hp + disp +hp:I(disp^2), data= mtcars)
> lm.gas
Call:
lm(formula = mpg ~ hp + disp + hp:I(disp^2), data = mtcars)
Coefficients:
(Intercept) hp disp hp:I(disp^2)
3.562e+01 -4.168e-02 -5.879e-02 3.151e-07
> install.packages("yhat")
also installing the dependency ‘yacca’
> library(yhat)
> regr(lm.gas)
Error in `[.data.frame`(new.data, , c(DV, IVx)) :
undefined columns selected
In addition: Warning message:
In regr(lm.gas) : NAs introduced by coercion
I suspect that the I(.) terms are not being saved in the result of the lm call in a manner that the regr function is able to handle.
The work around would be to calculate the values of the squared variables with separate names in an augmented dataset.
Based on the comments, I figured out the issue. The interaction terms (i.e., I(propnV^2)) weren't being read correctly by the function. So I added additional columns in my data frame with the squared values, so that the model was reading these terms as individual values, not trying to separate them. Corrected code is below:
## make new columns for interaction effect of seeding rate propn
DE$propnC2 <- DE$propnC^2
DE$propnV2 <- DE$propnV^2
DE$propnR2 <- DE$propnR^2
## run lm model with adjusted terms
DEregr_model <- lm(TotalBiomass ~ propnC + propnV + propnR + propnC2 + propnV2 + propnC:propnV + propnV:propnR + propnV:propnC2, DE_model)
DEregrout <- regr(DEregr_model)
The regr() function now runs without error, thanks everyone for your input!

Limit on the Number of Variables for bestglm Function

I am trying to run the bestglm function in R for subset selection and the run fails immediately if I use more than 15 variables in the function. I attached some sample code below (I know these models have far too many variables for this dataset, I am just including these models here as an example):
cars.df = data.frame(mtcars)
cars.df
resp.var = cars.df$mpg
ind.matrix.15 = model.matrix(mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb + disp:wt + drat:wt + qsec:am + gear:hp + cyl:disp + drat:gear, data = cars.df)[, -1]
matrix.xy.15 = data.frame(ind.matrix.15, y = as.matrix(resp.var))
bestglm(Xy = matrix.xy.15, family = gaussian(link = 'log'), nvmax = 15)
ind.matrix.16 = model.matrix(mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb + disp:wt + drat:wt + qsec:am + gear:hp + cyl:disp + drat:gear + disp:hp, data = cars.df)[, -1]
matrix.xy.16 = data.frame(ind.matrix.16, y = as.matrix(resp.var))
bestglm(Xy = matrix.xy.16, family = gaussian(link = 'log'), nvmax = 16)
The first bestglm function runs fine, but when I add an additional variable for a total of 16 features, the second bestglm function instantly produces this error message: p = 16. must be <= 15 for GLM.
Changing the method argument to a simpler algorithm such as backward rather than the default exhaustive does not make the error go away.
Is this just a limitation of the bestglm function, or is there an argument I can change to allow more than 15 features.
As #RomanLuštrik says, this is a hard-coded constraint in bestglm, presumably because 15 predictors means there are 2^15 = 32768 candidate models, and one has to stop somewhere ... as far as I can see there is no way around this constraint when running a GLM. (Roman's suggestion of RequireFullEnumerationQ=FALSE doesn't work, because the leaps-and-bound algorithm is only available for linear models, not GLMs.)
One possible strategy (not fully explored here) would be to fit the linear model exhaustively with leaps-and-bounds, save a large number of the top models (say TopModels=1000) and then re-evaluate the top models with your preferred variance structure ... this doesn't work directly in leaps, but can be hacked as follows:
leaps.obj <- leaps:::leaps.setup(matrix.xy.16,y=cars.df$mpg,nvmax=16,
nbest=10000)
bb <- leaps:::leaps.exhaustive(leaps.obj, really.big=TRUE)
but I don't know (and it seems like a lot of work) to figure out how to re-evaluate these models with a log-link Gaussian.
You might be able to get the glmulti package to work (it offers both method="h" for full enumeration and method="g" for a genetic algorithm), but so far I haven't managed to overcome some Java errors ...
Unfortunately, the J Stat Software article describing glmulti shows that this method has some of the same constraints:
For performance, the Java classes encode formulas as compact bit strings. Currently two integers (32 bits each) are used for main effects, and two long integers (128 bits) are used for each category of interaction terms (factor:factor, covariate:covariate, and factor:covariate),to encode models. This means that there can be at most 32 factors and 32 covariates, and, if including interactions, at most 128 interactions of each category. The latter constraint necessitates that, if x is the number of factors and y the number of covariates:x <16y <16xy <128

Margins Package error using quadratic and interaction terms

I have code which uses the margins command in Stata and I am trying to replicate it in R using the "margins" package found here and on cran.
I keep getting the error:
marg1<-margins(reg2)
Error in names(classes) <- clean_terms(names(classes)) : 'names' attribute [18] must be the same length as the vector [16]"
A minimum reproducible example is show below:
install.packages(margins)
library(margins)
mod1 <- lm(log(mpg) ~ vs + cyl + hp + vs*hp + I(vs*hp*hp) + wt + I(hp*hp), data = mtcars)
(marg1 <- margins(mod1))
summary(marg1)
I need vs to be a dummy variable interacted with both a quadratic term and a normal interaction.
Does anyone know what I am doing wrong or if there is a way around this?
Your model specification is a bit confusing. For example, vs*hp introduces 3 variables: i) vs, ii) hp and iii) interaction vs and hp. As a result, hp appears twice in the formula you provided. You can simplify massively! Try this for example (I think it is what you want):
mtcars$hp2 = mtcars$hp^2
mod1 <- lm(log(mpg) ~ cyl + wt + vs*hp + vs*hp2, data = mtcars)
summary(mod1) # With this you can check that the model you specified is what you want
(marg1 <- margins(mod1)) # The error disappeared.
summary(marg1)
In general, I would recommend you to avoid I() in formula specifications, as it often gives rise to such errors when not treated with enough care (though sometimes one cannot avoid it). Good luck!

Iterate over formula elements in R

Is there a way how to iterate over formula in R?
what I need to do lets say we have a formula given as: as.formula(hp ~ factor(gear) + qsec + am)
What I need to do is to iterate over elements of formula So I can create 3 model (3 because we use 3 regressors - no counting dummies)
I need to create first model as as.formula(hp ~ factor(gear)), then second like as.formula(hp ~ factor(gear) + qsec) and lastly as.formula(hp ~ factor(gear) + qsec + am)
Can we somehow use just one regressor in one iterration, then use two and when use three?
I need to automatize this for function and "hand" approach is not good
My approach here: create a string using sprintf and paste (with collapse option), coerce it to a formula, and then loop over the elements you want to include.
elements <- c("factor(gear)", "qsec", "am")
for (i in 1:length(elements)) {
fmla <- as.formula(sprintf("hp ~ %s", paste(elements[1:i], collapse = " + ")))
print(fmla)
print(summary(lm(fmla, data = mtcars)))
}
If you need to parse the formula gave, you could do something like this before running the loop above (might need to be modified for your specific setup):
library(stringr)
input_fmla <- "as.formula(hp ~ factor(gear) + qsec + am)"
temp <- str_remove_all(input_fmla, "(as.formula\\([^ ]* ~ |\\)$)")
elements <- trimws(str_split(temp, pattern = "\\+")[[1]])

removing covariates from a linear mixed model using update

I'm newish to R. I have a linear mixed model with several predictors and I want to test the significance of each of them. I know that I could use lmerTest but my co-authors want me to do a likelihood ratio test for each predictor instead. I would like to use the update function to get a series of submodels that omit each predictor in turn. I tried the following
data(mtcars)
h=lmer(mpg ~ 1 + cyl + disp + hp + drat + (1|carb), data=mtcars)
predvars=c("cyl","disp","hp","drat")
for (i in predvars){
modelform=update(as.formula(paste0("h, . ~ . -",i)))
print(summary(modelform))
}
I got the following error
Error in parse(text = x, keep.source = FALSE) :
:1:2: unexpected ','
1: h,
^
I also tried using lapply
Fits=lapply(predvars, function(x) {update(h, .~.-i, list(i=as.name(x)))})
names(Fits)=predvars
which doesn't actually update the model, it just refits the full model i times. What am I doing wrong? Thanks.
Your first attempt generates an error because you put h inside as.formula. Do:
modelform <- update(h, as.formula(paste0(". ~ . -",i)))

Resources