How to pass a formula as a string? - r

I'd like to pass a string as the formula in the aov function
This is my code
library(fpp)
formula <-
"score ~ single"
aov(
formula,
credit[c("single", "score")]
)
My goal is for the output to be the same as this
aov(score ~ single,
credit[c("single", "score")])

This question seems very close to How to pass string formula to R's lm and see the formula in the summary? except that question involves lm.
Below, do.call ensures that formula(formula) is evaluated before being sent to aov so that the Call: line in the output shows properly; otherwise, it would literally show formula(formula). do.call not only evaluates the formula but would also evaluate credit expanding it into a huge output showing all its values rather than the word credit so we quote it to prevent that. If you don't care what the Call: line looks like it could be shortened to aov(formula(formula), credit) .
do.call("aov", list(formula(formula), quote(credit)))
giving:
Call:
aov(formula = score ~ single, data = credit)
Terms:
single Residuals
Sum of Squares 834.84 95658.64
Deg. of Freedom 1 498
Residual standard error: 13.8595
Estimated effects may be unbalanced

You can use reformulate to create the formula on the fly.
response <- 'score'
pred <- 'single'
aov(reformulate(pred, response), credit[c("single", "score")])
#Call:
# aov(formula = score ~ single, data = credit[c("single", "score")])
#Terms:
# single Residuals
#Sum of Squares 834.84 95658.64
#Deg. of Freedom 1 498
#Residual standard error: 13.8595
#Estimated effects may be unbalanced

Related

why I can't get a confidence interval using predict function in R

I am trying to get a confidence interval for my response in a poisson regression model. Here is my data:
X <- c(1,0,2,0,3,1,0,1,2,0)
Y <- c(16,9,17,12,22,13,8,15,19,11)
What I've done so far:
(i) read my data
(ii) fit a Y by poisson regression with X as a covariate
model <- glm(Y ~ X, family = "poisson", data = mydata)
(iii) use predict()
predict(model,newdata=data.frame(X=4),se.fit=TRUE,interval="confidence",level=0.95, type = "response")
I was expecting to get "fit, lwr, upr" for my response but I got the following instead:
$fit
1
30.21439
$se.fit
1
6.984273
$residual.scale
[1] 1
Could anyone offer some suggestions? I am new to R and struggling with this problem for a long time.
Thank you very much.
First, the function predict() that you are using is the method predict.glm(). If you look at its help file, it does not even have arguments 'interval' or 'level'. It doesn't flag them as erroneous because predict.glm() has the (in)famous ... argument, that absorbs all 'extra' arguments. You can write confidence=34.2 and interval="woohoo" and it still gives the same answer. It only produces the estimate and the standard error.
Second, one COULD then take the fit +/- 2*se to get an approximate 95 percent confidence interval. However, without getting into the weeds of confidence intervals, pivotal statistics, non-normality in the response scale, etc., this doesn't give very satisfying intervals because, for instance, they often include impossible negative values.
So, I think a better approach is to form an interval in the link scale, then transform it (this is still an approximation, but probably better):
X <- c(1,0,2,0,3,1,0,1,2,0)
Y <- c(16,9,17,12,22,13,8,15,19,11)
model <- glm(Y ~ X, family = "poisson")
tmp <- predict(model, newdata=data.frame(X=4),se.fit=TRUE, type = "link")
exp(tmp$fit - 2*tmp$se.fit)
1
19.02976
exp(tmp$fit + 2*tmp$se.fit)
1
47.97273

In R, comparing two regression coefficients from the same model

Using R, I want to statistically compare two coefficients from the same regression. In the Stata software, there is the test B1 = B2. What is the equivalent in R? I check several posts, but no one answered this issue.
https://stats.stackexchange.com/questions/33013/what-test-can-i-use-to-compare-slopes-from-two-or-more-regression-models
SPSS: Comparing regression coefficient from multiple models
Comparing regression models in R
Here are some simulated data.
library('MASS')
mu <- c(0,0,0)
Sigma <- matrix(.5, nrow=3, ncol=3) + diag(3)*0.3
MyData <- mvrnorm(n=10000, mu=mu, Sigma=Sigma) %>%
as.data.frame()
names(MyData) = c('v1', 'v2', 'y')
MyModel = lm(y ~ v1 * v2, data = MyData)
summary(MyModel)
I want to compare the estimate of V1 to the one of V2. So that if V1 and V2 are manipulated, I would like to tell something like "the influence of V1 on Y, is significantly higher than the influence of V2 on Y"
You can try multcomp , so if you look at the coefficients of your model:
coefficients(MyModel)
(Intercept) v1 v2 v1:v2
0.006961219 0.373547048 0.394760005 -0.012167754
You want to find the difference between the 2nd and 3rd term, so your contrast matrix is:
# yes it looks a bit weird at first
ctr = rbind("v1-v2"=c(0,1,-1,0))
And we can apply this using glht:
summary(glht(MyModel,ctr))
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = y ~ v1 * v2, data = MyData)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
v1-v2 == 0 -0.02121 0.01640 -1.294 0.196
(Adjusted p values reported -- single-step method)
This works for most general linear models. In your summary function, you get the the significance of each term based on the effect / standard error. The glht function does something similar. One exception for logistic regression I can think of, is when you have complete separation

How to make glm object within a function take input variable names and not parameter names?

I've created a function which fits polynomial regression models with increasing degree upto the input degree. I also collect all such models in a list.
After executing this function for a given set of inputs, I want to inspect the model list to calculate the MSE. However I see that the individual models refer to parameter names within the function.
Question: How do I make the glm objects refer to actual variables
Function definition:
poly.iter = function(dep,indep,dat,deg){ #Function to iterate through polynomial fits upto input degree
set.seed(1)
par(mfrow=c(ceiling(sqrt(deg)),ceiling(sqrt(deg)))) #partitioning the plotting window
MSE.CV = rep(0,deg)
modlist = list()
xvar = seq(from=min(indep),to=max(indep),length.out = nrow(dat))
for (i in 1:deg){
mod = glm(dep~poly(indep,i),data=dat)
#MSE.CV[i] = cv.glm(dat,mod,K=10)$delta[2] #Inside of this function, cv.glm is generating warnings. Googling has not helped as it can typically happen with missing obs but we don't have any in Auto data
modlist = c(modlist,list(mod))
MSE.CV[i] = mean(mod$residuals^2) #GLM part is giving 5x the error i.e. delta is 5x of MSE. Not sure why
plot(jitter(indep),jitter(dep),cex=0.5,col="darkgrey")
preds = predict(mod,newdata=list(indep=xvar),se=T)
lines(xvar,preds$fit,col="blue",lwd=2)
matlines(xvar,cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit),lty=3,col="blue")
}
return(list("models"=modlist,"errors"=MSE.CV))
}
Function Call:
output.mpg.disp = poly.iter(mpg,displacement,Auto,9)
Inspecting 3rd degree model:
> output.mpg.disp[[1]][[3]]
Call: glm(formula = dep ~ poly(indep, i), data = dat)
Coefficients:
(Intercept) poly(indep, i)1 poly(indep, i)2 poly(indep, i)3
23.446 -124.258 31.090 -4.466
Degrees of Freedom: 391 Total (i.e. Null); 388 Residual
Null Deviance: 23820
Residual Deviance: 7392 AIC: 2274
Now I can't use this object inside cv.glm with 'Auto' dataset as it will not recognize indep, dep and i
You can use the as.formula() function to transform a string with your formula before calling glm(). This will solve your question (How do I make the glm objects refer to actual variables), but I'm not sure if it is enough for the calling cv.glm later (I couldn't reproduce your code here, without errors). To be clear, you replace the line
mod = glm(dep~poly(indep,i),data=dat)
with something like:
myexp = paste0(dep, "~ poly(", indep, ",", i, ")")
mod = glm(as.formula(myexp), data=dat)
it's required then to make the variables dep and indep to be characters with names of the variables that you want to refer to (e.g. indep="displ").

R: multivariate orthogonal regression without having to write the variable names explicitly

I have a dataframe train (21 predictors, 1 response, 1012 observations), and I suspect that the response is a nonlinear function of the predictors. Thus, I would like to perform a multivariate polynomial regression of the response on all the predictors, and then try to understand which are the most important terms. To avoid the collinearity problems of standard multivariate polynomial regression, I'd like to use multivariate orthogonal polynomials with polym(). However, I have quite a lot of predictors, and their names do not follow a simple rule. For example, in train I have predictors named X2,X3 and X5, but not X1 and X4. The response is X14. Is there a way to write the formula in lm without having to explicitly write the name of all predictors? Writing
OrthoModel=lm(X14~polym(.,2),data=train)
returns the error
Error in polym(., 2) : object '.' not found
EDIT: the model I wanted to fit contains about 3.5 billion terms, so it's useless. It's better to fit a term with only main effects, interactions and second degree terms -> 231 terms. I wrote the formula for a standard (non-orthogonal) second degree polynomial:
`as.formula(paste(" X14 ~ (", paste0(names(Xtrain), collapse="+"), ")^2", collapse=""))`
where Xtrain is obtained by train by deleting the response column X14. However, when I try to express the polynomial in an orthogonal basis, I get a parse text error:
as.formula(
paste(" X14 ~ (", paste0(names(Xtrain), collapse="+"), ")^2", "+",
paste( "poly(", paste0(names(Xtrain), ", degree=2)",
collapse="+"),
collapse="")
)
)
There are a couple of problems with that approach, one of which you already see but even if the dot could be expanded within polym you would still have faced an error when it came time for the 2 to be evaluated, because degree is a parameter after the "dots" in the polym argument list and it therefore must be supplied as a named parameter rather than just positionally offered.
An approach using as.formula succeeds (with the 'Orthodont' dataframe in pkg:nlme (although using 'Sex' as the dependent variable is statistically nonsense). I took out the "Subject" column from the data and also took out the "Sex" from the names passed to paste:
data(Orthodont, package="nlme")
lm( as.formula( paste("Sex~polym(" ,
paste(names(Orthodont[-(3:4)]), collapse=","),",degree=2)")),
data=Orthodont[-3])
Call:
lm(formula = as.formula(paste("Sex~polym(", paste(names(Orthodont[-(3:4)]),
collapse = ","), ",degree=2)")), data = Orthodont[-3])
Coefficients:
(Intercept) polym(distance, age, degree = 2)1.0
1.4433 -2.5849
polym(distance, age, degree = 2)2.0 polym(distance, age, degree = 2)0.1
0.4651 1.3353
polym(distance, age, degree = 2)1.1 polym(distance, age, degree = 2)0.2
-7.6514
Formula objects can be created from text input with as.formula. This is essentially an application of the last example in ?as.formula.

Pasting object names into the glm function in R

I have the following data
data.set <- data.frame("varA"=rnorm(50),"varB"=rnorm(50),
"varC"=rnorm(50), binary.outcome=sample(c(0,1),50,replace=T) )
exp.vars <- c("varA","varB","varC")
I then wish to apply a logistic model using all of the exp.vars as dependent variables without hard coding them (I want to put this into a function so that different combinations of exp.vars can be tried. My attempt:
results <- glm( binary.outcome ~ get(paste(exp.vars, collapse="+")), family=binomial,
data=data.set )
How can I get this to work?
The . in the formula tells R to use all variables in the data.frame data.set (except y) as predictors. This should do it:
glm( binary.outcome ~ ., family=binomial,
data=data.set )
Call: glm(formula = binary.outcome ~ ., family = binomial, data = data.set)
Coefficients:
(Intercept) varA varB varC
-0.4820 0.1878 -0.3974 -0.4566
Degrees of Freedom: 49 Total (i.e. Null); 46 Residual
Null Deviance: 66.41
Residual Deviance: 62.06 AIC: 70.06
and from ?formula
There are two special interpretations of . in a formula. The usual one
is in the context of a data argument of model fitting functions and
means ‘all columns not otherwise in the formula’: see terms.formula.
In the context of update.formula, only, it means ‘what was previously
in this part of the formula’.

Resources