how do i exclude specific variables from a glm in R? - r

I have 50 variables. This is how I use them all in my glm.
var = glm(Stuff ~ ., data=mydata, family=binomial)
But I want to exclude 2 of them. So how do I exclude 2 in specific? I was hoping there would be something like this:
var = glm(Stuff ~ . # notthisstuff, data=mydata, family=binomial)
thoughts?

In addition to using the - like in the comments
glm(Stuff ~ . - var1 - var2, data= mydata, family=binomial)
you can also subset the data frame passed in
glm(Stuff ~ ., data=mydata[ , !(names(mydata) %in% c('var1','var2'))], family=binomial)
or
glm(Stuff ~ ., data=subset(mydata, select=c( -var1, -var2 ) ), family=binomial )
(be careful with that last one, the subset function sometimes does not work well inside of other functions)
You could also use the paste function to create a string representing the formula with the terms of interest (subsetting to the group of predictors that you want), then use as.formula to convert it to a formula.

Related

Create formula using the name of a data frame column

Given a data.frame, I would like to (dynamically) create a formula y ~ ., where y is the name of the first column of the data.frame.
What complicates this beyond the approach of as.formula(paste(names(df)[1], "~ .")) is that the name of the column might be a function, e.g.:
names(model.frame(lm(I(Sepal.Length/Sepal.Width) ~ Species, data = iris)))[1] is "I(Sepal.Length/Sepal.Width)"
So I need the column name to be quoted, i.e. in the above example I would want the formula to be `I(Sepal.Length/Sepal.Width)` ~ ..
This works:
df <- model.frame(lm(I(Sepal.Length/Sepal.Width) ~ Species, data = iris))
fm <- . ~ .
fm[[2]] <- as.name(names(df)[1])
But is there a neat way to do it in one step?
We could use reformulate
reformulate(".", response = sprintf("`%s`", names(df)[1]))

Using variable to select covariates for glm

I am running a simulation of multiple experiments using random data to create glm models. In each individual experiment I need to select different covariates to build the glm. Is there a way to use variable names to specify which covariates to use in the formula? For example, for a data frame called data that will contain the heading y plus a set of other headings that changes with each iteration, something like:
data <- data.frame(x1 = c(1:100),x2 = c(2:101),x3 = c(3:102),x4 = c(4:103),x5 = c(5,104),y = c(6:105))
#Experiment #1:
covars = c(x1,x2,x4)
glm(y ~ sum(covars),data=data)
#Experiment #2:
covars = c(x1,x3,x4,x5)
glm(y ~ sum(covars),data=data)
#Experiment #3:
covars = c(x2,x4,x5)
glm(y ~ sum(covars),data=data)
#etc...
So far, I have tried using this approach with the sum & colnames functions but I get the following error: "invalid 'type' (character) of argument"
Thank you!
We can use . to represent all the columns except the dependent column 'y'
glm(y ~ ., data = data)

how to select variables to use them in a formula with R

I want to make one function where I can easily run multiple models. only the models input variables that are used differ. I use the rpart function for this model. ideally I have a table (named variables) with models and its variables. something that looks like this
model1 model2 model3 …………………
gender gender age
age education wageparents
education nfriends
married
and than have a function where I can just insert fun(data, variables)
what I used so far is:
tree <-rpart(wage ~ gender + age + education, method='class', data=Data, control=rpart.control(minsplit=1, minbucket=1, cp=0.002))
this works, but I have to change the model formula everytime
I tried something like this, but I am not sure what datatype I have to use etc.
wagefun <- function(Data, variables$model1){
tree <-rpart(wage ~ variables$model1, method='class', data=Data, control=rpart.control(minsplit=1, minbucket=1, cp=0.002))
return(tree)
}
Create the formula with reformulate:
form <- reformulate(termlabels = variables$model1, response = "wage", intercept = TRUE)
rpart(form, ...)
Note the intercept term that you have ignored so far: it is an additional modelling choice.

How do I create a "macro" for regressors in R?

For long and repeating models I want to create a "macro" (so called in Stata and there accomplished with global var1 var2 ...) which contains the regressors of the model formula.
For example from
library(car)
lm(income ~ education + prestige, data = Duncan)
I want something like:
regressors <- c("education", "prestige")
lm(income ~ #regressors, data = Duncan)
I could find is this approach. But my application on the regressors won't work:
reg = lm(income ~ bquote(y ~ .(regressors)), data = Duncan)
as it throws me:
Error in model.frame.default(formula = y ~ bquote(.y ~ (regressors)), data =
Duncan, : invalid type (language) for variable 'bquote(.y ~ (regressors))'
Even the accepted answer of same question:
lm(formula(paste('var ~ ', regressors)), data = Duncan)
strikes and shows me:
Error in model.frame.default(formula = formula(paste("var ~ ", regressors)),
: object is not a matrix`.
And of course I tried as.matrix(regressors) :)
So, what else can I do?
Here are some alternatives. No packages are used in the first 3.
1) reformulate
fo <- reformulate(regressors, response = "income")
lm(fo, Duncan)
or you may wish to write the last line as this so that the formula that is shown in the output looks nicer:
do.call("lm", list(fo, quote(Duncan)))
in which case the Call: line of the output appears as expected, namely:
Call:
lm(formula = income ~ education + prestige, data = Duncan)
2) lm(dataframe)
lm( Duncan[c("income", regressors)] )
The Call: line of the output look like this:
Call:
lm(formula = Duncan[c("income", regressors)])
but we can make it look exactly as in the do.call solution in (1) with this code:
fo <- formula(model.frame(income ~., Duncan[c("income", regressors)]))
do.call("lm", list(fo, quote(Duncan)))
3) dot
An alternative similar to that suggested by #jenesaisquoi in the comments is:
lm(income ~., Duncan[c("income", regressors)])
The approach discussed in (2) to the Call: output also works here.
4) fn$ Prefacing a function with fn$ enables string interpolation in its arguments. This solution is nearly identical to the desired syntax shown in the question using $ in place of # to perform substitution and the flexible substitution could readily extend to more complex scenarios. The quote(Duncan) in the code could be written as just Duncan and it will still run but the Call: shown in the lm output will look better if you use quote(Duncan).
library(gsubfn)
rhs <- paste(regressors, collapse = "+")
fn$lm("income ~ $rhs", quote(Duncan))
The Call: line looks almost identical to the do.call solutions above -- only spacing and quotes differ:
Call:
lm(formula = "income ~ education+prestige", data = Duncan)
If you wanted it absolutely the same then:
fo <- fn$formula("income ~ $rhs")
do.call("lm", list(fo, quote(Duncan)))
For the scenario you described, where regressors is in the global environment, you could use:
lm(as.formula(paste("income~", paste(regressors, collapse="+"))), data =
Duncan)
Alternatively, you could use a function:
modincome <- function(regressors){
lm(as.formula(paste("income~", paste(regressors, collapse="+"))), data =
Duncan)
}
modincome(c("education", "prestige"))

Adding interaction terms to step AIC in R

So I have a bunch of variables sitting in a data frame and I want to use the step function to select a model.
Right now I'm doing something like this
step(lm(SalePrice ~ Gr.Liv.Area + Total.Bsmt.SF + Garage.Area + Lot.Area, list= ~upper(Neighborhood + Neighborhood:Bedroom.AbvGr) ....
How do I add multiple interaction terms without having to manually input them with the : notation?
Here is one way of adding interactions: Assume that all your data of interest is in dat and your dependent variable is named y. The code
init_mod <- lm(y ~ ., data = dat)
step(init_mod, scope = . ~ .^2, direction = 'forward')
will add interaction terms to your model using AIC. If you want k order interactions you can replace .^2 with .^k.

Resources