Remove dependent variable from formula for model.matrix - r

I'm just learning how to deal with model.matrix. For example, to create out-of-sample predictions I extract the formula from my model, say it's a linear model.
Using the function formula(mymodel) extracts that:
form <- formula(y ~ x1 + x2 * x3)
Now, to create predictions I need a model.matrix without my y. I could type that by hand:
X <- model.matrix(~ x1 + x2 * x3, data=out.of.sample.data)
However, is there a way using, for example, update to get rid of the left part my formula?
Thanks!

It can be done with update by setting the response variable to NULL:
form <- formula(y ~ x1 + x2 * x3)
newform <- update(form, NULL ~ .)

This is how I usually do this. I'm not aware of a built-in function for this.
df = data.frame(y=rnorm(10), x1=rnorm(10), x2=rnorm(10), x3=rnorm(10))
mymodel = lm(y ~ x1 + x2 + x3, df)
form_vars_only =
formula(paste("~",strsplit(as.character(formula(mymodel)),"~")[[3]]))

Related

How to loop an ANCOVA test across multiple variables? [duplicate]

I am new to R and I want to improve the following script with an *apply function (I have read about apply, but I couldn't manage to use it). I want to use lm function on multiple independent variables (which are columns in a data frame). I used
for (i in (1:3) {
assign(paste0('lm.',names(data[i])), lm(formula=formula(i),data=data))
}
Formula(i) is defined as
formula=function(x)
{
as.formula ( paste(names(data[x]),'~', paste0(names(data[-1:-3]), collapse = '+')), env=parent.frame() )
}
Thank you.
If I don't get you wrong, you are working with a dataset like this:
set.seed(0)
dat <- data.frame(y1 = rnorm(30), y2 = rnorm(30), y3 = rnorm(30),
x1 = rnorm(30), x2 = rnorm(30), x3 = rnorm(30))
x1, x2 and x3 are covariates, and y1, y2, y3 are three independent response. You are trying to fit three linear models:
y1 ~ x1 + x2 + x3
y2 ~ x1 + x2 + x3
y3 ~ x1 + x2 + x3
Currently you are using a loop through y1, y2, y3, fitting one model per time. You hope to speed the process up by replacing the for loop with lapply.
You are on the wrong track. lm() is an expensive operation. As long as your dataset is not small, the costs of for loop is negligible. Replacing for loop with lapply gives no performance gains.
Since you have the same RHS (right hand side of ~) for all three models, model matrix is the same for three models. Therefore, QR factorization for all models need only be done once. lm allows this, and you can use:
fit <- lm(cbind(y1, y2, y3) ~ x1 + x2 + x3, data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
If you check str(fit), you will see that this is not a list of three linear models; instead, it is a single linear model with a single $qr object, but with multiple LHS. So $coefficients, $residuals and $fitted.values are matrices. The resulting linear model has an additional "mlm" class besides the usual "lm" class. I created a special mlm tag collecting some questions on the theme, summarized by its tag wiki.
If you have a lot more covariates, you can avoid typing or pasting formula by using .:
fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
Caution: Do not write
y1 + y2 + y3 ~ x1 + x2 + x3
This will treat y = y1 + y2 + y3 as a single response. Use cbind().
Follow-up:
I am interested in a generalization. I have a data frame df, where first n columns are dependent variables (y1,y2,y3,....) and next m columns are independent variables (x1+x2+x3+....). For n = 3 and m = 3 it is fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)). But how to do this automatically, by using the structure of the df. I mean something like (for i in (1:n)) fit <- lm(cbind(df[something] ~ df[something], data = dat)). That "something" I have created it with paste and paste0. Thank you.
So you are programming your formula, or want to dynamically generate / construct model formulae in the loop. There are many ways to do this, and many Stack Overflow questions are about this. There are commonly two approaches:
use reformulate;
use paste / paste0 and formula / as.formula.
I prefer to reformulate for its neatness, however, it does not support multiple LHS in the formula. It also needs some special treatment if you want to transform the LHS. So In the following I would use paste solution.
For you data frame df, you may do
paste0("cbind(", paste(names(df)[1:n], collapse = ", "), ")", " ~ .")
A more nice-looking way is to use sprintf and toString to construct the LHS:
sprintf("cbind(%s) ~ .", toString(names(df)[1:n]))
Here is an example using iris dataset:
string_formula <- sprintf("cbind(%s) ~ .", toString(names(iris)[1:2]))
# "cbind(Sepal.Length, Sepal.Width) ~ ."
You can pass this string formula to lm, as lm will automatically coerce it into formula class. Or you may do the coercion yourself using formula (or as.formula):
formula(string_formula)
# cbind(Sepal.Length, Sepal.Width) ~ .
Remark:
This multiple LHS formula is also supported elsewhere in R core:
the formula method for function aggregate;
ANOVA analysis with aov.

Add predictors one by one in random forest

once more I will need your help in order to solve a syntax problem and I thank you for that.
So I have a dataset that looks like that :
y <- rnorm(1000)
x1 <- rnorm(1000) + 0.2 * y
x2 <- rnorm(1000) + 0.2 * x1 + 0.1 * y
x3 <- rnorm(1000) - 0.1 * x1 + 0.3 * x2 - 0.3 * y
data <- data.frame(y, x1, x2, x3)
head(data)
#
I need a loop to run a random forest starting with one predictor and adding all the predictors one by one each time, like that:
randomForest(y ~ x1, data= data)
randomForest(y ~ x1 + x2, data= data)
randomForest(y ~ x1 + x2 + x3, data=data) etc...
Would you be kind enough to help me? Thank you in advance!
You can build the formula, and use as.formula()
lapply(1:3, \(i) {
formula = as.formula(paste0("y~",paste0("x",1:i, collapse="+")))
randomForest(formula, data=data)
})
A more general approach, for example if the predictors were not consistently named, or without specifying how many there are, would be to obtain a string vector of the predictors, say using colnames(), and adjust the loop slightly
predictors = colnames(data[,-1])
lapply(1:length(predictors), \(i) {
formula = as.formula(paste0("y~",paste0(predictors[1:i], collapse="+")))
randomForest(formula, data=data)
})

doing many lm() models at once in R [duplicate]

I am new to R and I want to improve the following script with an *apply function (I have read about apply, but I couldn't manage to use it). I want to use lm function on multiple independent variables (which are columns in a data frame). I used
for (i in (1:3) {
assign(paste0('lm.',names(data[i])), lm(formula=formula(i),data=data))
}
Formula(i) is defined as
formula=function(x)
{
as.formula ( paste(names(data[x]),'~', paste0(names(data[-1:-3]), collapse = '+')), env=parent.frame() )
}
Thank you.
If I don't get you wrong, you are working with a dataset like this:
set.seed(0)
dat <- data.frame(y1 = rnorm(30), y2 = rnorm(30), y3 = rnorm(30),
x1 = rnorm(30), x2 = rnorm(30), x3 = rnorm(30))
x1, x2 and x3 are covariates, and y1, y2, y3 are three independent response. You are trying to fit three linear models:
y1 ~ x1 + x2 + x3
y2 ~ x1 + x2 + x3
y3 ~ x1 + x2 + x3
Currently you are using a loop through y1, y2, y3, fitting one model per time. You hope to speed the process up by replacing the for loop with lapply.
You are on the wrong track. lm() is an expensive operation. As long as your dataset is not small, the costs of for loop is negligible. Replacing for loop with lapply gives no performance gains.
Since you have the same RHS (right hand side of ~) for all three models, model matrix is the same for three models. Therefore, QR factorization for all models need only be done once. lm allows this, and you can use:
fit <- lm(cbind(y1, y2, y3) ~ x1 + x2 + x3, data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
If you check str(fit), you will see that this is not a list of three linear models; instead, it is a single linear model with a single $qr object, but with multiple LHS. So $coefficients, $residuals and $fitted.values are matrices. The resulting linear model has an additional "mlm" class besides the usual "lm" class. I created a special mlm tag collecting some questions on the theme, summarized by its tag wiki.
If you have a lot more covariates, you can avoid typing or pasting formula by using .:
fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
Caution: Do not write
y1 + y2 + y3 ~ x1 + x2 + x3
This will treat y = y1 + y2 + y3 as a single response. Use cbind().
Follow-up:
I am interested in a generalization. I have a data frame df, where first n columns are dependent variables (y1,y2,y3,....) and next m columns are independent variables (x1+x2+x3+....). For n = 3 and m = 3 it is fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)). But how to do this automatically, by using the structure of the df. I mean something like (for i in (1:n)) fit <- lm(cbind(df[something] ~ df[something], data = dat)). That "something" I have created it with paste and paste0. Thank you.
So you are programming your formula, or want to dynamically generate / construct model formulae in the loop. There are many ways to do this, and many Stack Overflow questions are about this. There are commonly two approaches:
use reformulate;
use paste / paste0 and formula / as.formula.
I prefer to reformulate for its neatness, however, it does not support multiple LHS in the formula. It also needs some special treatment if you want to transform the LHS. So In the following I would use paste solution.
For you data frame df, you may do
paste0("cbind(", paste(names(df)[1:n], collapse = ", "), ")", " ~ .")
A more nice-looking way is to use sprintf and toString to construct the LHS:
sprintf("cbind(%s) ~ .", toString(names(df)[1:n]))
Here is an example using iris dataset:
string_formula <- sprintf("cbind(%s) ~ .", toString(names(iris)[1:2]))
# "cbind(Sepal.Length, Sepal.Width) ~ ."
You can pass this string formula to lm, as lm will automatically coerce it into formula class. Or you may do the coercion yourself using formula (or as.formula):
formula(string_formula)
# cbind(Sepal.Length, Sepal.Width) ~ .
Remark:
This multiple LHS formula is also supported elsewhere in R core:
the formula method for function aggregate;
ANOVA analysis with aov.

Fitting a linear model with multiple LHS

I am new to R and I want to improve the following script with an *apply function (I have read about apply, but I couldn't manage to use it). I want to use lm function on multiple independent variables (which are columns in a data frame). I used
for (i in (1:3) {
assign(paste0('lm.',names(data[i])), lm(formula=formula(i),data=data))
}
Formula(i) is defined as
formula=function(x)
{
as.formula ( paste(names(data[x]),'~', paste0(names(data[-1:-3]), collapse = '+')), env=parent.frame() )
}
Thank you.
If I don't get you wrong, you are working with a dataset like this:
set.seed(0)
dat <- data.frame(y1 = rnorm(30), y2 = rnorm(30), y3 = rnorm(30),
x1 = rnorm(30), x2 = rnorm(30), x3 = rnorm(30))
x1, x2 and x3 are covariates, and y1, y2, y3 are three independent response. You are trying to fit three linear models:
y1 ~ x1 + x2 + x3
y2 ~ x1 + x2 + x3
y3 ~ x1 + x2 + x3
Currently you are using a loop through y1, y2, y3, fitting one model per time. You hope to speed the process up by replacing the for loop with lapply.
You are on the wrong track. lm() is an expensive operation. As long as your dataset is not small, the costs of for loop is negligible. Replacing for loop with lapply gives no performance gains.
Since you have the same RHS (right hand side of ~) for all three models, model matrix is the same for three models. Therefore, QR factorization for all models need only be done once. lm allows this, and you can use:
fit <- lm(cbind(y1, y2, y3) ~ x1 + x2 + x3, data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
If you check str(fit), you will see that this is not a list of three linear models; instead, it is a single linear model with a single $qr object, but with multiple LHS. So $coefficients, $residuals and $fitted.values are matrices. The resulting linear model has an additional "mlm" class besides the usual "lm" class. I created a special mlm tag collecting some questions on the theme, summarized by its tag wiki.
If you have a lot more covariates, you can avoid typing or pasting formula by using .:
fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)
#Coefficients:
# y1 y2 y3
#(Intercept) -0.081155 0.042049 0.007261
#x1 -0.037556 0.181407 -0.070109
#x2 -0.334067 0.223742 0.015100
#x3 0.057861 -0.075975 -0.099762
Caution: Do not write
y1 + y2 + y3 ~ x1 + x2 + x3
This will treat y = y1 + y2 + y3 as a single response. Use cbind().
Follow-up:
I am interested in a generalization. I have a data frame df, where first n columns are dependent variables (y1,y2,y3,....) and next m columns are independent variables (x1+x2+x3+....). For n = 3 and m = 3 it is fit <- lm(cbind(y1, y2, y3) ~ ., data = dat)). But how to do this automatically, by using the structure of the df. I mean something like (for i in (1:n)) fit <- lm(cbind(df[something] ~ df[something], data = dat)). That "something" I have created it with paste and paste0. Thank you.
So you are programming your formula, or want to dynamically generate / construct model formulae in the loop. There are many ways to do this, and many Stack Overflow questions are about this. There are commonly two approaches:
use reformulate;
use paste / paste0 and formula / as.formula.
I prefer to reformulate for its neatness, however, it does not support multiple LHS in the formula. It also needs some special treatment if you want to transform the LHS. So In the following I would use paste solution.
For you data frame df, you may do
paste0("cbind(", paste(names(df)[1:n], collapse = ", "), ")", " ~ .")
A more nice-looking way is to use sprintf and toString to construct the LHS:
sprintf("cbind(%s) ~ .", toString(names(df)[1:n]))
Here is an example using iris dataset:
string_formula <- sprintf("cbind(%s) ~ .", toString(names(iris)[1:2]))
# "cbind(Sepal.Length, Sepal.Width) ~ ."
You can pass this string formula to lm, as lm will automatically coerce it into formula class. Or you may do the coercion yourself using formula (or as.formula):
formula(string_formula)
# cbind(Sepal.Length, Sepal.Width) ~ .
Remark:
This multiple LHS formula is also supported elsewhere in R core:
the formula method for function aggregate;
ANOVA analysis with aov.

GLM and GAM modelling in RStudio [duplicate]

I need to create a probit model without the intercept. So, how can I remove the intercept from a probit model in R?
You don't say how you are intending to fit the probit model, but if it uses R's formula notation to describe the model then you can supply either + 0 or - 1 as part of the formula to suppress the intercept:
mod <- foo(y ~ 0 + x1 + x2, data = bar)
or
mod <- foo(y ~ x1 + x2 - 1, data = bar)
(both using pseudo R code of course - substitute your modelling function and data/variables.)
If this is a model fitting by glm() then something like:
mod <- glm(y ~ x1 + x2 - 1, data = bar, family = binomial(link = "probit"))
should do it (again substituting in your data and variable names as appropriate.)
Also, if you have an existing formula object, foo, you can remove the intercept with update like this:
foo <- y ~ x1 + x2
bar <- update(foo, ~ . -1)
# bar == y ~ x1 + x2 - 1

Resources