Error when using regr() command: undefined columns selected - r

I get the following error when trying to run the regr() command from the yhat package:
Error in `[.data.frame`(new.data, , c(DV, IVx)) :
undefined columns selected
Here is the code I'm using:
DEregr_model <- lm(TotalBiomass ~ propnC + propnV + propnR + I(propnC^2) + I(propnV^2) + propnC:propnV + propnV:propnR + propnV:I(propnC^2), DE_model)
DEregrout <- regr(DEregr_model)
Why is this function returning an error?

I think I can demonstrate my suspicion expressed in the comments with this MCVE:
> lm.gas <- lm( mpg ~ hp + disp +hp:I(disp^2), data= mtcars)
> lm.gas
Call:
lm(formula = mpg ~ hp + disp + hp:I(disp^2), data = mtcars)
Coefficients:
(Intercept) hp disp hp:I(disp^2)
3.562e+01 -4.168e-02 -5.879e-02 3.151e-07
> install.packages("yhat")
also installing the dependency ‘yacca’
> library(yhat)
> regr(lm.gas)
Error in `[.data.frame`(new.data, , c(DV, IVx)) :
undefined columns selected
In addition: Warning message:
In regr(lm.gas) : NAs introduced by coercion
I suspect that the I(.) terms are not being saved in the result of the lm call in a manner that the regr function is able to handle.
The work around would be to calculate the values of the squared variables with separate names in an augmented dataset.

Based on the comments, I figured out the issue. The interaction terms (i.e., I(propnV^2)) weren't being read correctly by the function. So I added additional columns in my data frame with the squared values, so that the model was reading these terms as individual values, not trying to separate them. Corrected code is below:
## make new columns for interaction effect of seeding rate propn
DE$propnC2 <- DE$propnC^2
DE$propnV2 <- DE$propnV^2
DE$propnR2 <- DE$propnR^2
## run lm model with adjusted terms
DEregr_model <- lm(TotalBiomass ~ propnC + propnV + propnR + propnC2 + propnV2 + propnC:propnV + propnV:propnR + propnV:propnC2, DE_model)
DEregrout <- regr(DEregr_model)
The regr() function now runs without error, thanks everyone for your input!

Related

fixest::feols and ggeffects::ggeffect not working together in R

I'm having a hard time getting a fixest object to play nicely with ggeffects in R, when fixed effects are included.
When I run the following code:
m <- feols(mpg ~ disp + gear + hp | cyl, mtcars,
cluster = c("am", "cyl"))
summary(m)
marg1 <- ggeffect(m, terms = c("disp"))
I get an error reading:
Can't compute marginal effects, 'effects::Effect()' returned an error.
Reason: non-conformable arguments
You may try 'ggpredict()' or 'ggemmeans()'.
However, there are no problems when I remove the fixed effects term / include it without using the pipe:
m <- feols(mpg ~ disp + gear + hp + cyl, mtcars,
cluster = c("am", "cyl"))
summary(m)
marg1 <- ggeffect(m, terms = c("disp"))
ggpredict also returns an error on my data (Could not compute variance-covariance matrix of predictions. No confidence intervals are returned.) but I am unable to replicate that same error using the toy data.

Iterating and looping over multiple columns in glm in r using a name from another variable

I am trying to iterate over multiple columns for a glm function in R.
view(mtcars)
names <- names(mtcars[-c(1,2)])
for(i in 1:length(names)){
print(paste0("Starting iterations for ",names[i]))
model <- glm(mpg ~ cyl + paste0(names[i]), data=mtcars, family = gaussian())
summary(model)
print(paste0("Iterations for ",names[i], " finished"))
}
however, I am getting the following error:
[1] "Starting iterations for disp"
Error in model.frame.default(formula = mpg ~ cyl + paste0(names[i]), data = mtcars, :
variable lengths differ (found for 'paste0(names[i])')
Not sure, how I can correct this.
mpg ~ cyl + paste0(names[i]) or even mpg ~ cyl + names[i] is not a valid syntax for a formula. Use
reformulate(c("cyl", names[i]), "mpg")
instead, which dynamically creates a formula from variable names.
Since you need to build your model formula dynamically from string you need as.formula. Alternatively, consider reformulate which receives response and RHS variable names:
...
fml <- reformulate(c("cyl", names[i]), "mpg")
model <- glm(fml, data=mtcars, family = gaussian())
summary(model)
...
glm takes a formula which you can create using as.formula()
predictors <- names(mtcars[-c(1,2)])
for(predictor in predictors){
print(paste0("Starting iterations for ",predictor))
model <- glm(as.formula(paste0("mpg ~ cyl + ",predictor)),
data=mtcars,
family = gaussian())
print(summary(model))
print(paste0("Iterations for ",predictor, " finished"))
}

Multiple Linear Regression with character as dependent variable

I'm currently trying do perform a multiple linear regression on the voter turnout per state within the 2020 Presidential Election.
To create this regression model I would like to use the following variables: State, Total_Voters and Population.
When I try to run my linear regression I get the following error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y'
The dataset I've gathered is quite large. I have created a new dataframe with the variables which I need as follows:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, F1a AS Total_Voters, population.Pop AS Population FROM e_2020 INNER JOIN population ON population.State = e_2020.State_Full")
After that I remove all NA values:
Turnout_Rate_2020[is.na(Turnout_Rate_2020)] <- 0
After that I filter through the dataframe once more and filter out all the states which did not report:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, Total_Voters, Population FROM Turnout_Rate_2020 WHERE Total_Voters <> 0 AND Total_Voters >= 0 GROUP BY State_Full")
In the end the dataframe looks like this:
With the following summary:
However when I now try to run my multiple linear regression I get the error I have showcased above. The command looks like this:
lmTurnoutRate_2020 <- lm(State_Full ~ Population + Total_Voters, data = Turnout_Rate_2020)
I'm quite new to linear regressions but I'm eager to learn. I have looked through StackOverflow for quite a bit now, and couldn't figure it out.
It would be greatly appreciated if someone here would be able to assist me.
The full script at once:
Turnout_Rate_2020 <- sqldf("SELECT State_Full, F1a AS Total_Voters, population.Pop AS Population FROM e_2020 INNER JOIN population ON population.State = e_2020.State_Full")
# Change all NA to 0
Turnout_Rate_2020[is.na(Turnout_Rate_2020)] <- 0
summary(Turnout_Rate_2020)
# Select all again and filter out states which did not report. (values that were NA)
Turnout_Rate_2020 <- sqldf("SELECT State_Full, Total_Voters, Population FROM Turnout_Rate_2020 WHERE Total_Voters <> 0 AND Total_Voters >= 0 GROUP BY State_Full")
# Does not work and if I turn variables around I get NaN values.
lmTurnoutRate_2020 <- lm(State_Full ~ Population + Total_Voters, data = Turnout_Rate_2020)
summary(lmTurnoutRate_2020)
# Does not work
ggplot(lmTurnoutRate_2020, aes(x=State_Full,y=Population)) + geom_point() + geom_smooth(method=lm, level=0.95) + labs(x = "State", y = "Voters")
1) The input is missing from the question so we will use mtcars and make cyl a character column. lm cannot handle that but we could create a 0/1 model matrix from cyl and run that. This performs a separate lm for each level of cyl. This would only be applicable if the dependent variable had a small number of levels as we have here. If your dependent variable is naturally or has been cut into a small number of levels that would be the situation.
(Probably in this case we want to use logistic regression as with glm and family=binomial() or ordinal logistic regression as with polr in MASS or the ordinal package or multinom in nnet package but we will show it with lm just to show it can be done although it probably shouldn't be because with only two values the dependent variable is not sufficiently gaussian.)
mtcars2 <- transform(mtcars, cyl = as.character(cyl))
lm(model.matrix(~ cyl + 0) ~ hp, mtcars2)
giving:
Call:
lm(formula = model.matrix(~cyl + 0) ~ hp, data = mtcars2)
Coefficients:
cyl4 cyl6 cyl8
(Intercept) 1.052957 0.390688 -0.443645
hp -0.004835 -0.001172 0.006007
With polr (which assumes the levels are ordered as they are with cyl):
library(MASS)
polr(cyl ~ hp, transform(mtcars2, cyl = factor(cyl)))
giving:
Call:
polr(formula = cyl ~ hp, data = transform(mtcars2, cyl = factor(cyl)))
Coefficients:
hp
0.1156849
Intercepts:
4|6 6|8
12.32592 17.25331
Residual Deviance: 20.35585
AIC: 26.35585
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
The other possibility is that your dependent variable just happens to be represented as character because of how it was created but could be numeric if one used as.numeric(...) on it. We can't tell without the input but using our example we can do this although again it is likely inappropriate because cyl has only 3 values and so does not approximate a gaussian closely enough. Your data may be different though.
lm(cyl ~ hp, transform(mtcars2, cyl = as.numeric(cyl)))
giving:
Call:
lm(formula = cyl ~ hp, data = transform(mtcars2, cyl = as.numeric(cyl)))
Coefficients:
(Intercept) hp
3.00680 0.02168

Margins Package error using quadratic and interaction terms

I have code which uses the margins command in Stata and I am trying to replicate it in R using the "margins" package found here and on cran.
I keep getting the error:
marg1<-margins(reg2)
Error in names(classes) <- clean_terms(names(classes)) : 'names' attribute [18] must be the same length as the vector [16]"
A minimum reproducible example is show below:
install.packages(margins)
library(margins)
mod1 <- lm(log(mpg) ~ vs + cyl + hp + vs*hp + I(vs*hp*hp) + wt + I(hp*hp), data = mtcars)
(marg1 <- margins(mod1))
summary(marg1)
I need vs to be a dummy variable interacted with both a quadratic term and a normal interaction.
Does anyone know what I am doing wrong or if there is a way around this?
Your model specification is a bit confusing. For example, vs*hp introduces 3 variables: i) vs, ii) hp and iii) interaction vs and hp. As a result, hp appears twice in the formula you provided. You can simplify massively! Try this for example (I think it is what you want):
mtcars$hp2 = mtcars$hp^2
mod1 <- lm(log(mpg) ~ cyl + wt + vs*hp + vs*hp2, data = mtcars)
summary(mod1) # With this you can check that the model you specified is what you want
(marg1 <- margins(mod1)) # The error disappeared.
summary(marg1)
In general, I would recommend you to avoid I() in formula specifications, as it often gives rise to such errors when not treated with enough care (though sometimes one cannot avoid it). Good luck!

removing covariates from a linear mixed model using update

I'm newish to R. I have a linear mixed model with several predictors and I want to test the significance of each of them. I know that I could use lmerTest but my co-authors want me to do a likelihood ratio test for each predictor instead. I would like to use the update function to get a series of submodels that omit each predictor in turn. I tried the following
data(mtcars)
h=lmer(mpg ~ 1 + cyl + disp + hp + drat + (1|carb), data=mtcars)
predvars=c("cyl","disp","hp","drat")
for (i in predvars){
modelform=update(as.formula(paste0("h, . ~ . -",i)))
print(summary(modelform))
}
I got the following error
Error in parse(text = x, keep.source = FALSE) :
:1:2: unexpected ','
1: h,
^
I also tried using lapply
Fits=lapply(predvars, function(x) {update(h, .~.-i, list(i=as.name(x)))})
names(Fits)=predvars
which doesn't actually update the model, it just refits the full model i times. What am I doing wrong? Thanks.
Your first attempt generates an error because you put h inside as.formula. Do:
modelform <- update(h, as.formula(paste0(". ~ . -",i)))

Resources