I have an issue that concerns itself with extracting output from a regression for all possible combinations of dummy variable while keeping the continuous predictor variables fixed.
The problem is that my model contains over 100 combinations of interactions and manually calculating all of these will be quite tedious. Is there an efficient method for iteratively calculating output?
The only way I can think of is to write a loop that generates all desired combinations to subsequently feed into the predict() function.
Some context:
I am trying to identify the regional differences of automobile resale prices by the model of car.
My model looks something like this:
lm(data, price ~ age + mileage + region_dummy_1 + ... + region_dummy_n + model_dummy_1 + ... + model_dummy_n + region_dummy_1 * model_dummy_1 + ... + region_dummy_1 * model_dummy_n)
My question is:
How do I produce a table of predicted prices for every model/region combination?
Use .*.
lm(price ~ .*.)
Here's a small reproducible example:
> df <- data.frame(y = rnorm(100,0,1),
+ x1 = rnorm(100,0,1),
+ x2 = rnorm(100,0,1),
+ x3 = rnorm(100,0,1))
>
> lm(y ~ .*., data = df)
Call:
lm(formula = y ~ . * ., data = df)
Coefficients:
(Intercept) x1 x2 x3 x1:x2 x1:x3
-0.02036 0.08147 0.02354 -0.03055 0.05752 -0.02399
x2:x3
0.24065
How does it work?
. is shorthand for "all predictors", and * includes the two-way interaction term.
For example, consider a dataframe with 3 columns: Y (independent variable), and 2 predictors (X1 and X2). The syntax lm(Y ~ X1*X2) is shorthand for lm(Y ~ X1 + X2 + X1:X2), where, X1:X2 is the interaction term.
Extending this simple case, imagine we have a data frame with 3 predictors, X1, X2, and X3. lm(Y ~ .*.) is equivalent to lm(Y ~ X1 + X2 + X3 + X1:X2 + X1:X3 + X2:X3).
Related
We are given the following dataset [dataset used for linear regression][1]
[1]: https://github.com/Iron-Maiden-19/regression/blob/master/shel2x.csv and we fit this linear regression model - Model A
modelA <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8,data=shel2x)
which is fine but then we are given the following problem which I am unsure how to solve the following question - Fit Model B and compare the AIC of it to modelA and here is modelB:
Y = β0 + β1X1+ β2X2+ β3X2^2 + β4X4+ β5X6 +ε
So I know the beta values represent my coefficients from the first model but how do I do regression and how do I form an equation for regression.
In R, you perform a linear regression just the way you already have.
modelA <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8,data=shel2x)
ModelA is a linear model of the form:
Y = beta0 + beta1*X1 + beta2*X2 + beta3*X3 + beta4*X4 + beta5*X5 + beta6*X6 + beta7*X7 + beta8*X8
So, to fit model B, you would just create another linear model in the following manner:
modelB <- lm(Y ~ X1 + X2 + X2^2 + X4 + X6, data=shel2x)
Then calling:
summary(modelA)
summary(modelB)
Should give you the summary output for the two separate linear models, which will include the separate AIC for both of them. Without running the models and without looking at your data, I'm almost positive that modelB will have a smaller AIC, as it always tends to favor the more parsimonious model.
How would one run a multiple linear regression on R, with > 100 covariates?
Is there a faster way besides (y ~ x1 + x2 + x3 + ... + x100)?
lm(y ~ ., data = YourData)
You can use the . , which takes all columns other than the response column from the supplied data as covariates.
I'm just learning how to deal with model.matrix. For example, to create out-of-sample predictions I extract the formula from my model, say it's a linear model.
Using the function formula(mymodel) extracts that:
form <- formula(y ~ x1 + x2 * x3)
Now, to create predictions I need a model.matrix without my y. I could type that by hand:
X <- model.matrix(~ x1 + x2 * x3, data=out.of.sample.data)
However, is there a way using, for example, update to get rid of the left part my formula?
Thanks!
It can be done with update by setting the response variable to NULL:
form <- formula(y ~ x1 + x2 * x3)
newform <- update(form, NULL ~ .)
This is how I usually do this. I'm not aware of a built-in function for this.
df = data.frame(y=rnorm(10), x1=rnorm(10), x2=rnorm(10), x3=rnorm(10))
mymodel = lm(y ~ x1 + x2 + x3, df)
form_vars_only =
formula(paste("~",strsplit(as.character(formula(mymodel)),"~")[[3]]))
It's a bit of a long question so thanks for bearing with me.
Here's my data
https://www.dropbox.com/s/jo22d68a8vxwg63/data.csv?dl=0
I constructed a mixed effect model
library(lme4)
mod <- lmer(sqrt(y) ~ x1 + I(x1^2) + x2 + I(x2^2) + x3 + I(x3^2) + x4 + I(x4^2) + x5 + I(x5^2) +
x6 + I(x6^2) + x7 + I(x7^2) + x8 + I(x8^2) + (1|loc) + (1|year), data = data)
All the predictors are standardised and I am interested in knowing how does y changes with changes in x5while keeping other variables at their mean values (equal to 0 since all the variables are standardised).
This is how I do it.
# make all predictors except x5 equal to zero
data$x1<-0
data$x2<-0
data$x3<-0
data$x4<-0
data$x6<-0
data$x7<-0
data$x8<-0
# Use the predict function
library(merTools)
fitted <- predictInterval(merMod = mod, newdata = data, level = 0.95, n.sims = 1000,stat = "median",include.resid.var = TRUE)
Now I want to plot the fitted as a quadratic function of x5. I do this:
i<-order(data$x5)
plot(data$x5[i],fitted$fit[i],type="l")
I expected this to produce a plot of y as a quadratic function of x5. But As you can see, I get the following plot which does not have any quadratic curve. Can anyone tell me what I am doing wrong here?
I'm not sure where predictInterval comes from, but you can do this with predict. The trick is just to make sure you set your random effects to 0. Here's how you can do that
newdata <- data
newdata[,paste0("x", setdiff(1:8,5))] <- 0
y <- predict(mod, newdata=newdata, re.form=NA)
plot(data$x5, y)
The re.form=NA part drops out the random effect
I need to create a probit model without the intercept. So, how can I remove the intercept from a probit model in R?
You don't say how you are intending to fit the probit model, but if it uses R's formula notation to describe the model then you can supply either + 0 or - 1 as part of the formula to suppress the intercept:
mod <- foo(y ~ 0 + x1 + x2, data = bar)
or
mod <- foo(y ~ x1 + x2 - 1, data = bar)
(both using pseudo R code of course - substitute your modelling function and data/variables.)
If this is a model fitting by glm() then something like:
mod <- glm(y ~ x1 + x2 - 1, data = bar, family = binomial(link = "probit"))
should do it (again substituting in your data and variable names as appropriate.)
Also, if you have an existing formula object, foo, you can remove the intercept with update like this:
foo <- y ~ x1 + x2
bar <- update(foo, ~ . -1)
# bar == y ~ x1 + x2 - 1