Adding interaction terms to step AIC in R - r

So I have a bunch of variables sitting in a data frame and I want to use the step function to select a model.
Right now I'm doing something like this
step(lm(SalePrice ~ Gr.Liv.Area + Total.Bsmt.SF + Garage.Area + Lot.Area, list= ~upper(Neighborhood + Neighborhood:Bedroom.AbvGr) ....
How do I add multiple interaction terms without having to manually input them with the : notation?

Here is one way of adding interactions: Assume that all your data of interest is in dat and your dependent variable is named y. The code
init_mod <- lm(y ~ ., data = dat)
step(init_mod, scope = . ~ .^2, direction = 'forward')
will add interaction terms to your model using AIC. If you want k order interactions you can replace .^2 with .^k.

Related

How to add to my lm regression formula a control for the dependent variable at (t-1)

I'm structuring a simple linear regression model using lm and since I think that my dependent variable is history-dependent I would like to add a control variable that is the dependent variable at time (t-1).
How can I do that in R? This is my model so far:
lm(Y ~ Country + Year + GDP, data=df)
And I would like to add a control variable like so:
lm(Y ~ Y(t-1) + Country + Year + GDP, data=df)
I hope it is clear! Thanks in advance!
Can use dplyr library to add the control feature.
library(dplyr)
# create a new variable for the value of Y at time (t-1)
df$Y_lag <- lag(df$Y)
# run the regression with Y_lag as a control variable
fit <- lm(Y ~ Y_lag + Country + Year + GDP, data=df)
1) Assuming that the data is in order of increasing time to align the variables appropriately remove the first or last element of each column. We have used the built in BOD data frame since the input was not provided in the question -- see the info at the top of the r tag page for guidance on asking questions.
nr <- nrow(BOD)
lm(demand[-1] ~ Time[-1] + demand[-nr], BOD)
2) Use flag from the collapse package.
library(collapse)
lm(demand ~ Time + flag(demand), BOD)
3) An alternative is the dyn package. In particular comparing models using anova is tricky when using lags but dyn has an anova.dyn which addresses these problems automatically. Note that dplyr clobbers R's lag with its own lag and will result in dozens or hundreds of packages to fail so be sure that dplyr is not loaded or use library(dplyr, exclkude = c("lag","filter")) if you need dplyr. This works with zoo and ts objects.
library(dyn)
z <- read.zoo(BOD, drop = FALSE)
dyn$lm(demand ~ time(z) + lag(demand, -1), z)
4) There is also the dynlm package which uses a slightly different syntax than dyn but is similar in prinicple. An important advantage of dynlm is that it supports instrumental variables regression via two-stage least squares.

Regression with many variables, but not enough to justify using . and subtracting unnecessary variables

I'm trying to run a regression with roughly 20 variables, in a dataset that has 50 variables. So it looks something like:
lm(data=data, formula = y ~ explanatory_1 + ... + explanatory_20)
Obviously this works fine, but we want the code to look a little cleaner. A lot of answers tell you to use . - however, I don't want to do that, because the dataset has about 20 or so variables that we don't use in the regression. i.e. We'd be subtracting as many variables as we include in the normal regression.
Is there a way to group the explanatory vars into a list, so it can instead look like
lm(data=data, formula = y ~ list)?
Furthermore, in some specifications we include a new covariate that also acts as an interaction term on all the original covariates, so ideally we would have
lm(data=data, formula = y ~ list + new_var + new_var:list).
Can this be done? Thanks!
You can put the explanatory variables in a vector and use reformulate
x_vars <- c('cyl', 'disp', 'hp')
lm(data = mtcars, formula = reformulate(x_vars, response = 'mpg'))

Using variable to select covariates for glm

I am running a simulation of multiple experiments using random data to create glm models. In each individual experiment I need to select different covariates to build the glm. Is there a way to use variable names to specify which covariates to use in the formula? For example, for a data frame called data that will contain the heading y plus a set of other headings that changes with each iteration, something like:
data <- data.frame(x1 = c(1:100),x2 = c(2:101),x3 = c(3:102),x4 = c(4:103),x5 = c(5,104),y = c(6:105))
#Experiment #1:
covars = c(x1,x2,x4)
glm(y ~ sum(covars),data=data)
#Experiment #2:
covars = c(x1,x3,x4,x5)
glm(y ~ sum(covars),data=data)
#Experiment #3:
covars = c(x2,x4,x5)
glm(y ~ sum(covars),data=data)
#etc...
So far, I have tried using this approach with the sum & colnames functions but I get the following error: "invalid 'type' (character) of argument"
Thank you!
We can use . to represent all the columns except the dependent column 'y'
glm(y ~ ., data = data)

did modeling in R - right set up of data in staggered model

I appreciated any insights into staggered did (difference-in-differences) models.
I wanted to ask if I use the correct function to set-up the model for a did (data structure provided below):
did=time*treated
didreg = lm(y ~ time + treated + did + x + factor(year) + factor(firm), data = sample)
The data looks like:
I'm not familiar with difference-in-difference modelling, but from skimming the Wiki it seems that what you want is a simple interaction. To fit that, you don't even need to calculate a new variable (did), but you can specify it directly in the model. There's couple of ways to specify that with R formula syntax:
# Simple main effects models, no interactions
main_mod <- lm(y ~ time + treated + x + factor(year) + factor(firm), data = sample)
# Model with the interaction effect explicitly specified
did_mod1 <- lm(y ~ time + treated + time:treated + x + factor(year) + factor(firm), data = sample)
# Model with shortened syntax for specifying interactions
did_mod2 <- lm(y ~ time * treated + x + factor(year) + factor(firm), data = sample)
did_mod1 and did_mod2 are identical, did_mod2 is just a more compact way of writing the same model. The * indicates that you want both the main effects and the interactions of the variables to the left and the right. It's recommended to always fit main effects when you fit interactions, so the second way of writing the model saves time & space.

how do i exclude specific variables from a glm in R?

I have 50 variables. This is how I use them all in my glm.
var = glm(Stuff ~ ., data=mydata, family=binomial)
But I want to exclude 2 of them. So how do I exclude 2 in specific? I was hoping there would be something like this:
var = glm(Stuff ~ . # notthisstuff, data=mydata, family=binomial)
thoughts?
In addition to using the - like in the comments
glm(Stuff ~ . - var1 - var2, data= mydata, family=binomial)
you can also subset the data frame passed in
glm(Stuff ~ ., data=mydata[ , !(names(mydata) %in% c('var1','var2'))], family=binomial)
or
glm(Stuff ~ ., data=subset(mydata, select=c( -var1, -var2 ) ), family=binomial )
(be careful with that last one, the subset function sometimes does not work well inside of other functions)
You could also use the paste function to create a string representing the formula with the terms of interest (subsetting to the group of predictors that you want), then use as.formula to convert it to a formula.

Resources