I have the following data
data.set <- data.frame("varA"=rnorm(50),"varB"=rnorm(50),
"varC"=rnorm(50), binary.outcome=sample(c(0,1),50,replace=T) )
exp.vars <- c("varA","varB","varC")
I then wish to apply a logistic model using all of the exp.vars as dependent variables without hard coding them (I want to put this into a function so that different combinations of exp.vars can be tried. My attempt:
results <- glm( binary.outcome ~ get(paste(exp.vars, collapse="+")), family=binomial,
data=data.set )
How can I get this to work?
The . in the formula tells R to use all variables in the data.frame data.set (except y) as predictors. This should do it:
glm( binary.outcome ~ ., family=binomial,
data=data.set )
Call: glm(formula = binary.outcome ~ ., family = binomial, data = data.set)
Coefficients:
(Intercept) varA varB varC
-0.4820 0.1878 -0.3974 -0.4566
Degrees of Freedom: 49 Total (i.e. Null); 46 Residual
Null Deviance: 66.41
Residual Deviance: 62.06 AIC: 70.06
and from ?formula
There are two special interpretations of . in a formula. The usual one
is in the context of a data argument of model fitting functions and
means ‘all columns not otherwise in the formula’: see terms.formula.
In the context of update.formula, only, it means ‘what was previously
in this part of the formula’.
Related
When I was doing polynomial regression, I tried to put fits with different degree of polynomial into a list, so I wrapped the glm into a function:
library(MASS)
myglm <- function(dop) {
# dop: degree of polynomial
glm(nox ~ poly(dis, degree = dop), data = Boston)
}
However, I guess there might be some problem related to lazy evaluation. The degree of the model is parameter dop rather than a specific number.
r$> myglm(2)
Call: glm(formula = nox ~ poly(dis, degree = dop), data = Boston)
Coefficients:
(Intercept) poly(dis, degree = dop)1 poly(dis, degree = dop)2
0.5547 -2.0031 0.8563
Degrees of Freedom: 505 Total (i.e. Null); 503 Residual
Null Deviance: 6.781
Residual Deviance: 2.035 AIC: -1347
When I do cross-validation using this model, an error occurs:
>>> cv.glm(Boston, myglm(2))
Error in poly(dis, degree = dop) : object 'dop' not found
So how can I solve this problem ?
Quosures, quasiquotation, and tidy evaluation are useful here:
library(MASS)
library(boot)
library(rlang)
myglm <- function(dop) {
eval_tidy(quo(glm(nox ~ poly(dis, degree = !! dop), data = Boston)))
}
cv.glm(Boston, myglm(2))
I'd like to pass a string as the formula in the aov function
This is my code
library(fpp)
formula <-
"score ~ single"
aov(
formula,
credit[c("single", "score")]
)
My goal is for the output to be the same as this
aov(score ~ single,
credit[c("single", "score")])
This question seems very close to How to pass string formula to R's lm and see the formula in the summary? except that question involves lm.
Below, do.call ensures that formula(formula) is evaluated before being sent to aov so that the Call: line in the output shows properly; otherwise, it would literally show formula(formula). do.call not only evaluates the formula but would also evaluate credit expanding it into a huge output showing all its values rather than the word credit so we quote it to prevent that. If you don't care what the Call: line looks like it could be shortened to aov(formula(formula), credit) .
do.call("aov", list(formula(formula), quote(credit)))
giving:
Call:
aov(formula = score ~ single, data = credit)
Terms:
single Residuals
Sum of Squares 834.84 95658.64
Deg. of Freedom 1 498
Residual standard error: 13.8595
Estimated effects may be unbalanced
You can use reformulate to create the formula on the fly.
response <- 'score'
pred <- 'single'
aov(reformulate(pred, response), credit[c("single", "score")])
#Call:
# aov(formula = score ~ single, data = credit[c("single", "score")])
#Terms:
# single Residuals
#Sum of Squares 834.84 95658.64
#Deg. of Freedom 1 498
#Residual standard error: 13.8595
#Estimated effects may be unbalanced
I've created a function which fits polynomial regression models with increasing degree upto the input degree. I also collect all such models in a list.
After executing this function for a given set of inputs, I want to inspect the model list to calculate the MSE. However I see that the individual models refer to parameter names within the function.
Question: How do I make the glm objects refer to actual variables
Function definition:
poly.iter = function(dep,indep,dat,deg){ #Function to iterate through polynomial fits upto input degree
set.seed(1)
par(mfrow=c(ceiling(sqrt(deg)),ceiling(sqrt(deg)))) #partitioning the plotting window
MSE.CV = rep(0,deg)
modlist = list()
xvar = seq(from=min(indep),to=max(indep),length.out = nrow(dat))
for (i in 1:deg){
mod = glm(dep~poly(indep,i),data=dat)
#MSE.CV[i] = cv.glm(dat,mod,K=10)$delta[2] #Inside of this function, cv.glm is generating warnings. Googling has not helped as it can typically happen with missing obs but we don't have any in Auto data
modlist = c(modlist,list(mod))
MSE.CV[i] = mean(mod$residuals^2) #GLM part is giving 5x the error i.e. delta is 5x of MSE. Not sure why
plot(jitter(indep),jitter(dep),cex=0.5,col="darkgrey")
preds = predict(mod,newdata=list(indep=xvar),se=T)
lines(xvar,preds$fit,col="blue",lwd=2)
matlines(xvar,cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit),lty=3,col="blue")
}
return(list("models"=modlist,"errors"=MSE.CV))
}
Function Call:
output.mpg.disp = poly.iter(mpg,displacement,Auto,9)
Inspecting 3rd degree model:
> output.mpg.disp[[1]][[3]]
Call: glm(formula = dep ~ poly(indep, i), data = dat)
Coefficients:
(Intercept) poly(indep, i)1 poly(indep, i)2 poly(indep, i)3
23.446 -124.258 31.090 -4.466
Degrees of Freedom: 391 Total (i.e. Null); 388 Residual
Null Deviance: 23820
Residual Deviance: 7392 AIC: 2274
Now I can't use this object inside cv.glm with 'Auto' dataset as it will not recognize indep, dep and i
You can use the as.formula() function to transform a string with your formula before calling glm(). This will solve your question (How do I make the glm objects refer to actual variables), but I'm not sure if it is enough for the calling cv.glm later (I couldn't reproduce your code here, without errors). To be clear, you replace the line
mod = glm(dep~poly(indep,i),data=dat)
with something like:
myexp = paste0(dep, "~ poly(", indep, ",", i, ")")
mod = glm(as.formula(myexp), data=dat)
it's required then to make the variables dep and indep to be characters with names of the variables that you want to refer to (e.g. indep="displ").
I am very new to statistics and R. In my dataset the target variable is flight status to predict if the flight could be delayed or it could be on-time. So, it has two values for response variable - Delayed and on-time. So, in order to construct a logistic regression model using R, do we have to recode the target variable to 0 and 1 first? I mean does it need to be 0-Delayed and 1 for Ontime. or can I keep the target variable as factor?
Please forgive me for the basic question.
data(iris)
Binary dependent variable:
iris$Species_binary <- ifelse(iris$Species=="setosa", "no", "yes")
Does it work as a factor?
glm(as.factor(iris$Species_binary)~iris$Sepal.Length, family="binomial")
Yes, it does.
Call: glm(formula = as.factor(iris$Species_binary) ~ iris$Sepal.Length,
family = "binomial")
Coefficients:
(Intercept) iris$Sepal.Length
-27.829 5.176
Degrees of Freedom: 149 Total (i.e. Null); 148 Residual
Null Deviance: 191
Residual Deviance: 71.84 AIC: 75.84
Would it work as a logical (boolean) variable?
glm(I(iris$Species_binary=="yes")~iris$Sepal.Length, family="binomial")
Call: glm(formula = I(iris$Species_binary == "yes") ~ iris$Sepal.Length,
family = "binomial")
Coefficients:
(Intercept) iris$Sepal.Length
-27.829 5.176
Degrees of Freedom: 149 Total (i.e. Null); 148 Residual
Null Deviance: 191
Residual Deviance: 71.84 AIC: 75.84
Yes, it would. Of course, a numeric variable would also work.
This is the case in most other packages/functions for logit as well, but it's possible that some could behave differently. Note that the logistic link is the default for the binomial family, which is why I didn't have to specify it.
Be sure that you know which level of the factor is counted as the positive level if you do it that way, though! Otherwise your interpretation of the results will be backwards.
I'm trying to estimate a logistic unit fixed effects model for panel data using R. My dependent variable is binary and measured daily over two years for 13 locations.
The goal of this model is to predict the value of y for a particular day and location based on x.
zero <- seq(from=0, to=1, by=1)
ids = dplyr::data_frame(location=seq(from=1, to=13, by=1))
dates = dplyr::data_frame(date = seq(as.Date("2015-01-01"), as.Date("2016-12-31"), by="days"))
data = merge(dates, ids)
data$y <- sample(zero, size=9503, replace=TRUE)
data$x <- sample(zero, size=9503, replace=TRUE)
While surveying the available packages to do so, I've read a number of ways to (apparently) do this, but I'm not confident I've understood the differences between packages and approaches.
From what I have read so far, glm(), survival::clogit() and pglm::pglm() can be used to do this, but I'm wondering if there are substantial differences between the packages and what those might be.
Here are the calls I've used:
fixed <- glm(y ~ x + factor(location), data=data)
fixed <- clogit(y ~ x + strata(location), data=data)
One of the reasons for this insecurity is the error I get when using pglm (also see this question) that pglm can't use the "within" model:
fixed <- pglm(y ~ x, data=data, index=c("location", "date"), model="within", family=binomial("logit")).
What distinguishes the "within" model of pglm from the approaches in glm() and clogit() and which of the three would be the correct one to take here when trying to predict y for a given date and unit?
I don't see that you have defined a proper hypothesis to test within the context of what you are calling "panel data", but as far as getting glm to give estimates for logistic coefficients within strata it can be accomplished by adding family="binomial" and stratifying by your "unit" variable:
> fixed <- glm(y ~ x + strata(unit), data=data, family="binomial")
> fixed
Call: glm(formula = y ~ x + strata(unit), family = "binomial", data = data)
Coefficients:
(Intercept) x strata(unit)unit=2 strata(unit)unit=3
0.10287 -0.05910 -0.08302 -0.03020
strata(unit)unit=4 strata(unit)unit=5 strata(unit)unit=6 strata(unit)unit=7
-0.06876 -0.05042 -0.10200 -0.09871
strata(unit)unit=8 strata(unit)unit=9 strata(unit)unit=10 strata(unit)unit=11
-0.09702 0.02742 -0.13246 -0.04816
strata(unit)unit=12 strata(unit)unit=13
-0.11449 -0.16986
Degrees of Freedom: 9502 Total (i.e. Null); 9489 Residual
Null Deviance: 13170
Residual Deviance: 13170 AIC: 13190
That will not take into account any date-ordering, which is what I would have expected to be the interest. But as I said above, there doesn't yet appear to be a hypothesis that is premised on any sequential ordering.
This would create a fixed effects model that included a spline relationship of date to probability of y-event. I chose to center the date rather than leaving it as a very large integer:
library(splines)
fixed <- glm(y ~ x + ns(scale(date),3) + factor(unit), data=data, family="binomial")
fixed
#----------------------
Call: glm(formula = y ~ x + ns(scale(date), 3) + factor(unit), family = "binomial",
data = data)
Coefficients:
(Intercept) x ns(scale(date), 3)1 ns(scale(date), 3)2
0.13389 -0.05904 0.04431 -0.10727
ns(scale(date), 3)3 factor(unit)2 factor(unit)3 factor(unit)4
-0.03224 -0.08302 -0.03020 -0.06877
factor(unit)5 factor(unit)6 factor(unit)7 factor(unit)8
-0.05042 -0.10201 -0.09872 -0.09702
factor(unit)9 factor(unit)10 factor(unit)11 factor(unit)12
0.02742 -0.13246 -0.04816 -0.11450
factor(unit)13
-0.16987
Degrees of Freedom: 9502 Total (i.e. Null); 9486 Residual
Null Deviance: 13170
Residual Deviance: 13160 AIC: 13200