I would like to know how can I come up with a lm formula syntax that would enable me to use paste together with cbind for multiple multivariate regression.
Example
In my model I have a set of variables, which corresponds to the primitive example below:
data(mtcars)
depVars <- paste("mpg", "disp")
indepVars <- paste("qsec", "wt", "drat")
Problem
I would like to create a model with my depVars and indepVars. The model, typed by hand, would look like that:
modExmple <- lm(formula = cbind(mpg, disp) ~ qsec + wt + drat, data = mtcars)
I'm interested in generating the same formula without referring to variable names and only using depVars and indepVars vectors defined above.
Attempt 1
For example, what I had on mind would correspond to:
mod1 <- lm(formula = formula(paste(cbind(paste(depVars, collapse = ",")), " ~ ",
indepVars)), data = mtcars)
Attempt 2
I tried this as well:
mod2 <- lm(formula = formula(cbind(depVars), paste(" ~ ",
paste(indepVars,
collapse = " + "))),
data = mtcars)
Side notes
I found a number of good examples on how to use paste with formula but I would like to know how I can combine with cbind.
This is mostly a syntax a question; in my real data I've a number of variables I would like to introduce to the model and making use of the previously generated vector is more parsimonious and makes the code more presentable. In effect, I'm only interested in creating a formula object that would contain cbind with variable names corresponding to one vector and the remaining variables corresponding to another vector.
In a word, I want to arrive at the formula in modExample without having to type variable names.
Think it works.
data(mtcars)
depVars <- c("mpg", "disp")
indepVars <- c("qsec", "wt", "drat")
lm(formula(paste('cbind(',
paste(depVars, collapse = ','),
') ~ ',
paste(indepVars, collapse = '+'))), data = mtcars)
All the solutions below use these definitions:
depVars <- c("mpg", "disp")
indepVars <- c("qsec", "wt", "drat")
1) character string formula Create a character string representing the formula and then run lm using do.call. Note that the the formula shown in the output displays correctly and is written out.
fo <- sprintf("cbind(%s) ~ %s", toString(depVars), paste(indepVars, collapse = "+"))
do.call("lm", list(fo, quote(mtcars)))
giving:
Call:
lm(formula = "cbind(mpg, disp) ~ qsec+wt+drat", data = mtcars)
Coefficients:
mpg disp
(Intercept) 11.3945 452.3407
qsec 0.9462 -20.3504
wt -4.3978 89.9782
drat 1.6561 -41.1148
1a) This would also work:
fo <- sprintf("cbind(%s) ~.", toString(depVars))
do.call("lm", list(fo, quote(mtcars[c(depVars, indepVars)])))
giving:
Call:
lm(formula = cbind(mpg, disp) ~ qsec + wt + drat, data = mtcars[c(depVars,
indepVars)])
Coefficients:
mpg disp
(Intercept) 11.3945 452.3407
qsec 0.9462 -20.3504
wt -4.3978 89.9782
drat 1.6561 -41.1148
2) reformulate #akrun and #Konrad, in comments below the question suggest using reformulate. This approach produces a "formula" object whereas the ones above produce a character string as the formula. (If this were desired for the prior solutions above it would be possible using fo <- formula(fo) .) Note that it is important that the response argument to reformulate be a call object and not a character string or else reformulate will interpret the character string as the name of a single variable.
fo <- reformulate(indepVars, parse(text = sprintf("cbind(%s)", toString(depVars)))[[1]])
do.call("lm", list(fo, quote(mtcars)))
giving:
Call:
lm(formula = cbind(mpg, disp) ~ qsec + wt + drat, data = mtcars)
Coefficients:
mpg disp
(Intercept) 11.3945 452.3407
qsec 0.9462 -20.3504
wt -4.3978 89.9782
drat 1.6561 -41.1148
3) lm.fit Another way that does not use a formula at all is:
m <- as.matrix(mtcars)
fit <- lm.fit(cbind(1, m[, indepVars]), m[, depVars])
The output is a list with these components:
> names(fit)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
Related
I am building a logistic regression model with a data set containing about 40 variables. The first step I use when building these types of models is I run each variable univariately with the DV (Hosmer, Lemeshow, & Sturdivant, 2013). I have built a function that does this for me and returns the p-value of each.
Fit Univariate logistic regression model for each covariate
uni.log2 <- function(x) {
log.mod2 <- glm(Renewf ~ x, data = dt.train2, family = binomial())
return(coef(summary(log.mod2))[,4]) #get p-values only
}
I then apply this function to each of the selected columns in my dt
#apply function to selected IV's
apply(X = dt.train2[c(3:16)], MARGIN = 2, FUN = uni.log2)
The next step I would like to do is screen these variables for a p-values with a threshold of p < 0.25 and return a list of the names of the variables which were univariately significant at p < 0.25.
Does anyone have any idea how this can be done?
I am able to set a threshold and copy a list of names from a multivariate model using this code:
threshold <- 0.001
signif_form <- as.formula(paste("Renewf ~
",paste(names(which((summary(log.mod2)$coefficients[2:
(nrow(summary(log.mod2)$coefficients)), 4] < threshold) == TRUE)), collapse
= "+")))
But, again, I do not know how to paste the names from the series of univariate regression models. If someone knows how to do this I would greatly appreciate some help.
Thank you in advance!
If you still want to use this approach after looking over the link provided by #BenBolker (and perhaps other resources on the perils of stepwise regression and statistical significance)...
The following code will return a vector of p-values for the independent variable in each regression. I've used the built-in mtcars data frame for illustration.
library(tidyverse)
library(broom)
pvals = sapply(names(mtcars)[names(mtcars) != "vs"], function(x) {
glm(paste("vs ~ ", x), data=mtcars, family=binomial) %>%
tidy %>%
filter(term==x) %>% pull(p.value)
})
pvals
mpg cyl disp hp drat wt qsec am
0.006590045 0.001917098 0.002453817 0.012340143 0.021777872 0.008672977 0.008813419 0.343628917
gear carb
0.250981095 0.004157666
The code above uses the pipe operator (%>%) to chain functions together. After creating the model with glm, tidy returns the coefficients and p-values as a data frame:
glm(vs ~ mpg, data=mtcars, family=binomial) %>%
tidy
term estimate std.error statistic p.value
1 (Intercept) -8.8330726 3.162274 -2.793266 0.005217877
2 mpg 0.4304135 0.158422 2.716880 0.006590045
Then the filter and pull functions select the p-value for the particular variable under consideration:
glm(vs ~ mpg, data=mtcars, family=binomial) %>%
tidy %>% filter(term=="mpg") %>% pull(p.value)
[1] 0.006590045
Wrapping the whole thing in sapply returns a named vector of p-values, where the names are the independent variables in each univariate regression.
To return only elements below a p-value threshold:
pvals[pvals < 0.25]
mpg cyl disp hp drat wt qsec carb
0.006590045 0.001917098 0.002453817 0.012340143 0.021777872 0.008672977 0.008813419 0.004157666
If you just want the names of the variables that meet the threshold criterion:
names(pvals[pvals < 0.25])
To directly return just the elements below the p-value threshold:
pvals = sapply(names(mtcars)[names(mtcars) != "vs"], function(x) {
glm(paste("vs ~ ", x), data=mtcars, family=binomial) %>%
tidy %>%
filter(term==x) %>% pull(p.value)
}) %>% .[. < 0.25]
Finally, packaging this as a function to return the names of the desired variables:
select_vars = function(DV, data, threshold) {
sapply(names(data)[names(data) != DV], function(x) {
glm(paste(DV, " ~ ", x), data=data, family=binomial) %>%
tidy %>%
filter(term==x) %>% pull(p.value)
}) %>% .[. < threshold] %>% names
}
select_vars("vs", mtcars, 0.25)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "carb"
select_vars("Species", iris %>% filter(Species %in% c("versicolor","virginica")), 0.001)
[1] "Sepal.Length" "Petal.Length" "Petal.Width"
I have question regarding the lm() function in R.
I understand that lm() is used for regression modeling and I know that one can do this:
lm(response ~ explanatory1 + explanatory2 + ... + explanatoryN, data = dataset)
Now my question is: "Suppose that N is large, is there a short cut that I can use that doesn't involve me having to write all N variable names?"
Thanks in advance!
Edit: I left out a big part of the question that I really needed an answer to. Suppose that I wanted to remove 1 to k explanatory variables and only include n-k of those variables.
Assuming mtcars as an example:
I would capture the predictors. I stick to a basic example, but one could use regex with grep and keep the same logic (see below). I am using all the columns with the exception of the first one ("mpg").
predictors <- names(mtcars)[-1]
# [1] "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
myFormula <- paste("mpg ~ ", paste0(predictors, collapse = " + "))
# [1] "mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb"
lm(data = mtcars, formula = myFormula)
Regex Example
Assuming iris as an example. I would like to match all the column names containing "Petal".
predictors <- grep(x = names(iris), pattern = "Petal", value = TRUE)
#[1] "Petal.Length" "Petal.Width"
myFormula <- paste("Sepal.Width ~ ", paste0(predictors, collapse = " + "))
# [1] "Sepal.Width ~ Petal.Length + Petal.Width"
lm(data = iris, formula = myFormula)
You could use the dot sign to select all variables, and just use the minus sign to select those that should not be used as predictors.
lm(Sepal.Length ~ .-Species -Petal.Length, iris)
Call:
lm(formula = Sepal.Length ~ . - Species - Petal.Length, data = iris)
Coefficients:
(Intercept) Sepal.Width Petal.Width
3.4573 0.3991 0.9721
You can use the .
lm(response~., data = data)
You can just use a dot lm(response ~ ., data= dataset)
Example using the mtcars dataset (already in R)
ex = lm(mpg~., data = mtcars)
summary (ex)
I know there are several ways to compare regression models. One way it to create models (from linear to multiple) and compare R2, Adjusted R2, etc:
Mod1: y=b0+b1
Mod2: y=b0+b1+b2
Mod3: y=b0+b1+b2+b3 (etc)
I´m aware that some packages could perform a stepwise regression, but I'm trying to analyze that with purrr. I could create several simple linear models (Thanks for this post here), and now I want to Know how can create regression models adding a specific IV to equation:
reproducible code
data(mtcars)
library(tidyverse)
library(purrr)
library(broom)
iv_vars <- c("cyl", "disp", "hp")
make_model <- function(nm) lm(mtcars[c("mpg", nm)])
fits <- Map(make_model, iv_vars)
glance_tidy <- function(x) c(unlist(glance(x)), unlist(tidy(x)[, -1]))
t(iv_vars %>% Map(f = make_model) %>% sapply(glance_tidy))
Output
What I want:
Mod1: mpg ~cyl
Mod2: mpg ~cly + disp
Mod3: mpg ~ cly + disp + hp
Thanks much.
I would begin by creating a list tibble storing your formulae. Then map the model over the formula, and map glance over the models.
library(tidyverse)
library(broom)
mtcars %>% as_tibble()
formula <- c(mpg ~ cyl, mpg ~ cyl + disp)
output <-
tibble(formula) %>%
mutate(model = map(formula, ~lm(formula = .x, data = mtcars)),
glance = map(model, glance))
output$glance
output %>% unnest(glance)
You could cumulatively paste over your vector of id_vars to get the combinations you want. I used the code in this answer to do this.
I use the plus sign as the separator between variables to get ready for the formula notation in lm.
cumpaste = function(x, .sep = " ") {
Reduce(function(x1, x2) paste(x1, x2, sep = .sep), x, accumulate = TRUE)
}
( iv_vars_cum = cumpaste(iv_vars, " + ") )
[1] "cyl" "cyl + disp" "cyl + disp + hp"
Then switch the make_model function to use a formula and a dataset. The explanatory variables, separated by the plus sign, get passed to the function after the tilde in the formula. Everything is pasted together, which lm conveniently interprets as a formula.
make_model = function(nm) {
lm(paste0("mpg ~", nm), data = mtcars)
}
Which we can see works as desired, returning a model with both explanatory variables.
make_model("cyl + disp")
Call:
lm(formula = as.formula(paste0("mpg ~", nm)), data = mtcars)
Coefficients:
(Intercept) cyl disp
34.66099 -1.58728 -0.02058
You'll likely need to rethink how you want to combine the info together, as you will now how differing numbers of columns due to the increased number of coefficients.
A possible option is to add dplyr::bind_rows to your glance_tidy function and then use map_dfr from purrr for the final output.
glance_tidy = function(x) {
dplyr::bind_rows( c( unlist(glance(x)), unlist(tidy(x)[, -1]) ) )
}
iv_vars_cum %>%
Map(f = make_model) %>%
map_dfr(glance_tidy, .id = "model")
# A tibble: 3 x 28
model r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cyl 0.7261800 0.7170527 3.205902 79.56103 6.112687e-10 2 -81.65321 169.3064
2 cyl + disp 0.7595658 0.7429841 3.055466 45.80755 1.057904e-09 3 -79.57282 167.1456
3 cyl + disp + hp 0.7678877 0.7430186 3.055261 30.87710 5.053802e-09 4 -79.00921 168.0184 ...
This question already has answers here:
Formula with dynamic number of variables
(5 answers)
Closed 5 years ago.
I am trying to build an R shiny application where I have the user select variables for a model. The elements that the user selects get put into a vector. How do I remove the quotes as well as put spaces between each element they select, to be variables in a model?
As an example:
> vars <- c("cyl", "disp", "hp")
> my.model <- lm(mpg ~ paste(vars, collapse = "+"), data = mtcars)
Gives the error:
Error in model.frame.default(formula = mpg ~ paste(vars, collapse = "+"), :
variable lengths differ (found for 'paste(vars, collapse = "+")')
From reading other somewhat similar questions on Stackoverflow, someone suggested to use as.name() to remove the quotation marks, but this produces another error:
> vars <- c("cyl", "disp", "hp")
> my.model <- lm(mpg ~ as.name(paste(vars, collapse = "+")), data = mtcars)
Error in model.frame.default(formula = mpg ~ as.name(paste(vars, collapse = "+")), :
invalid type (symbol) for variable 'as.name(paste(vars, collapse = "+"))'
A formula is not just a string without quotes. It's a collection of un-evaluated symbols. Try using the build in reformulate function to build your formula.
vars <- c("cyl", "disp", "hp")
my.model <- lm(reformulate(vars,"mpg"), data = mtcars)
as.formula should be able to coerce strings into formula
lm(as.formula((paste("mpg ~", paste(vars, collapse = "+")))), data = mtcars)
#Call:
#lm(formula = as.formula((paste("mpg ~", paste(vars, collapse = "+")))),
# data = mtcars)
#Coefficients:
#(Intercept) cyl disp hp
# 34.18492 -1.22742 -0.01884 -0.01468
The following code fits 4 different model formulas to the mtcars dataset, using either for loop or lapply. In both cases, the formula stored in the result is referred to as formulas[[1]], formulas[[2]], etc. instead of the human-readable formula.
formulas <- list(
mpg ~ disp,
mpg ~ I(1 / disp),
mpg ~ disp + wt,
mpg ~ I(1 / disp) + wt
)
res <- vector("list", length=length(formulas))
for (i in seq_along(formulas)) {
res[[i]] <- lm(formulas[[i]], data=mtcars)
}
res
lapply(formulas, lm, data=mtcars)
Is there a way to make the full, readable formula show up in the result?
This should work
lapply(formulas, function(x, data) eval(bquote(lm(.(x),data))), data=mtcars)
And it retruns
[[1]]
Call:
lm(formula = mpg ~ disp, data = data)
Coefficients:
(Intercept) disp
29.59985 -0.04122
[[2]]
Call:
lm(formula = mpg ~ I(1/disp), data = data)
Coefficients:
(Intercept) I(1/disp)
10.75 1557.67
....etc
We use bquote to insert the formula into the call to lm and then evaluate the expression.
Why not just:
lapply( formulas, function(frm) lm( frm, data=mtcars))
#------------------
[[1]]
Call:
lm(formula = frm, data = mtcars)
Coefficients:
(Intercept) disp
29.59985 -0.04122
[[2]]
Call:
lm(formula = frm, data = mtcars)
Coefficients:
(Intercept) I(1/disp)
10.75 1557.67
snpped....
If you wanted the names of the result to have the 'character'-ized version of the formulas it would just be"
names(res) <- as.character(formulas)
res[1]
#-----
$`mpg ~ disp`
Call:
lm(formula = frm, data = mtcars)
Coefficients:
(Intercept) disp
29.59985 -0.04122
you can also try something like
library(purrr)
library(tibble)
models <- map(formulas, lm, data = mtcars)
models