Create formula call from character string - r

I use a best subset selection package to determine the best independent variables from which to build my model (I do have a specific reason for doing this instead of using the best subset object directly). I want to programmatically extract the feature names and use the resulting string to build my model formula. The result would be something like this:
x <- "x1 + x2 + x3"
y <- "Surv(time, event)"
Because I'm building a coxph model, the formula is as follows:
coxph(Surv(time, event) ~ x1 + x2 + x3)
Using these string fields, I tried to construct the formula like so:
form <- y ~ x
This creates an object of class formula but when I call coxph it doesn't evaluate based on the references created form the formula object. I get the following error:
Error in model.frame.default(formula = y ~ x) : object is not a matrix
If I call eval on the objects y and x within the coxph call, I get the following:
Error in model.frame.default(formula = eval(y) ~ eval(x), data = df) :
variable lengths differ (found for 'eval(x)')
I'm not really sure how to proceed. Thanks for your input.

Couldn't find a good dupe, so posting comment as an answer.
If you build the full formula as a string, including the ~, you can use as.formula on it, e.g.,
x = "x1 + x2 + x3"
y = "Surv(time, event)"
form = as.formula(paste(y, "~", x))
coxph(form, data = your_data)
For a reproducible example, consider the first example at the bottom of the ?coxph help page:
library(survival)
test1 <- list(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,1,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
# Fit a stratified model
coxph(Surv(time, status) ~ x + strata(sex), test1)
# Call:
# coxph(formula = Surv(time, status) ~ x + strata(sex), data = test1)
#
# coef exp(coef) se(coef) z p
# x 0.802 2.231 0.822 0.98 0.33
#
# Likelihood ratio test=1.09 on 1 df, p=0.3
# n= 7, number of events= 5
lhs = "Surv(time, status)"
rhs = "x + strata(sex)"
form = as.formula(paste(lhs, "~", rhs))
form
# Surv(time, status) ~ x + strata(sex)
## formula looks good
coxph(form, test1)
# Call:
# coxph(formula = form, data = test1)
#
# coef exp(coef) se(coef) z p
# x 0.802 2.231 0.822 0.98 0.33
Same results either way.

Related

How to pass "weights" column name as a variable in R's lm?

The code below creates a linear model with R's lm, then a weighted model with a weights column. Finally, I try to pass in the weight column name with a variable weight_col and that fails. I'm pretty sure it's looking for "weight_col" in df, then the caller's environment, finds a variable of length 1, and the lengths don't match.
How do I get it to use weight_col as a name for the weights column in df?
I've tried several combinations of things without success.
> df <- data.frame(
x=c(1,2,3),
y=c(4,5,7),
w=c(1,3,5)
)
> lm(y ~ x, data=df)
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
2.333 1.500
> lm(y ~ x, data=df, weights=w)
Call:
lm(formula = y ~ x, data = df, weights = w)
Coefficients:
(Intercept) x
1.947 1.658
> weight_col <- 'w'
> lm(y ~ x, data=df, weights=weight_col)
Error in model.frame.default(formula = y ~ x, data = df, weights = weight_col, :
variable lengths differ (found for '(weights)')
> R.version.string
[1] "R version 3.6.3 (2020-02-29)"
You can use the data frame name with extractor operator:
lm(y ~ x, data = df, weights = df[[weight_col]])
Or you can use function get:
lm(y ~ x, data = df, weights = get(weight_col))
We can use [[ to extract the value of the column
lm(y ~ x, data=df, weights=df[[weight_col]])
Or with tidyverse
library(dplyr)
df %>%
summarise(model = list(y ~ x, weights = .data[[weight_col]]))
Your first example if weights = w, which is using non-standard evaluation to find w in the context of df. So far, this is normal for interactive use.
Your second set is weights = weight_col which resolves to weights = "w", which is very different. There is nothing in R's non-standard (or standard) evaluation in which that makes sense.
As I said in my comment, use the standard-evaluation form with [[.
lm(y ~ x, data=df, weights=df[[weight_col]])
# Call:
# lm(formula = y ~ x, data = df, weights = df[[weight_col]])
# Coefficients:
# (Intercept) x
# 1.947 1.658

create function to name objects in R using existing character vector

I am trying to master building functions in R. Say I have a data frame or data.table,
dummy <- df(y, x, a, b, who)
Where the vector "who" is like so,
who <- c("Joseph", "Kim", "Billy")
I would like to use the character vector to perform various regression models and name the outputs and their summary statistics. So for the entry, "Billy" in the vector above, I would like something like this:
function() {
ols.reg.Billy <- lm(y ~ x + a + b, data = dummy[dummy$who == "Billy"])
dw.Billy <- dwtest(ols.reg.Billy)
output.Billy <- list(ols.reg.Billy, dw.Billy)
return(output.Billy)
}
But for 500 different entries of the who vector above.
Is there some way to do this? What's the most efficient way? I keep getting errors and I feel I am seriously missing something. Is there some way to use paste?
If this doesn't solve it, please provide a reproducible example. It makes it easier to help you.
library(lmtest)
outputs <- lapply(who, function(name) {
ols.reg <- lm(y ~ x + a + b, data = dummy[dummy$who == name])
dw <- dwtest(ols.reg)
output <- paste(c("ols.reg","dw"), name, sep = "_")
return(output)
})
1) Map Using the built in CO2 data set suppose we wish to regress uptake on conc separately for each Type. Note that this names the components by the Type.
Map(function(x) lm(uptake ~ conc, CO2, subset = Type == x), levels(CO2$Type))
giving this two component list (one component for each level of Type -- Quebec and Mississauga) -- continued after output.
$Quebec
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == x)
Coefficients:
(Intercept) conc
23.50304 0.02308
$Mississippi
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == x)
Coefficients:
(Intercept) conc
15.49754 0.01238
2) Map/do.call We may wish to not only name the components using the Type but also have x substituted with the actual Type in the Call: line of the output. In that case use do.call to invoke lm and use quote to ensure that the name of the data frame rather than its value is displayed and use bquote to perform the substitution for x.
reg <- function(x) {
do.call("lm", list(uptake ~ conc, quote(CO2), subset = bquote(Type == .(x))))
}
Map(reg, levels(CO2$Type))
giving:
$Quebec
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == "Quebec")
Coefficients:
(Intercept) conc
23.50304 0.02308
$Mississippi
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == "Mississippi")
Coefficients:
(Intercept) conc
15.49754 0.01238
3) lmList The nlme package has lmList for doing this:
library(nlme)
lmList(uptake ~ conc | Type, CO2, pool = FALSE)
giving:
Call:
Model: uptake ~ conc | Type
Data: CO2
Coefficients:
(Intercept) conc
Quebec 23.50304 0.02308005
Mississippi 15.49754 0.01238113

Estimate equation with varying numbers of parameters in R (lm) with demeaned variables

I want to estimate an equation such as:
(where the bar denotes the mean of a variable.... Meaning, I want to automatically have interactions between Z and a demeaned version of X. So far I just demean the variables manually beforehand and estimate:
lm(Y ~ .*Z, data= sdata)
This seems to be working, but I would rather use a solution that does not require manual demeaning beforehand because I would also like to include the means of more complex terms, such as:
Edit:
As requested, a working code-sample, note that in the actual thing I have large (and varying) numbers of X- variables, so that I dont want to use a hard-coded variant:
x1 <- runif(100)
x2 <- runif(100)
Z <- runif(100)
Y <- exp(x1) + exp(x2) + exp(z)
##current way of estimating the first equation:
sdata <- data.frame(Y=Y,Z=Z,x1=x1-mean(x1),x2=x2-mean(x2))
lm(Y ~ .*Z, data= sdata)
##basically what I want is that the following terms, and their interactions with Z are also used:
# X1^2 - mean(X1^2)
# X2^2 - mean(X2^2)
# X1*X2 - mean(X1*X2)
Edit 2:
Now, what I want to achieve is basically what
lm(Y ~ .^2*Z, data= sdata)
would do. However, given prior demeaing expressions in there, such as: Z:X1:X2 would correspond to: (x1-mean(x1))*(x2-mean(x2)), while what I want to have is x1*x2-mean(x1*x2)
To show that scale works inside a formula:
lm(mpg ~ cyl + scale(disp*hp, scale=F), data=mtcars)
Call:
lm(formula = mpg ~ cyl + scale(disp * hp, scale = F), data = mtcars)
Coefficients:
(Intercept) cyl scale(disp * hp, scale = F)
3.312e+01 -2.105e+00 -4.642e-05
Now for comparison let's scale the interaction outside the formula:
mtcars$scaled_interaction <- with(mtcars, scale(disp*hp, scale=F))
lm(mpg ~ cyl + scaled_interaction, data=mtcars)
Call:
lm(formula = mpg ~ cyl + scaled_interaction, data = mtcars)
Coefficients:
(Intercept) cyl scaled_interaction
3.312e+01 -2.105e+00 -4.642e-05
At least in these examples, it seems as if scale inside formulae is working.
To provide a solution to your specific issue:
Alternative 1: Use formulae
# fit without Z
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
vars <- attr(mod$terms, "term.labels")
vars <- gsub(":", "*", vars) # needed so that scale works later
vars <- paste0("scale(", vars, ", scale=F)")
newf <- as.formula(paste0("Y ~ ", paste0(vars, collapse = "+")))
# now interact with Z
f2 <- update.formula(newf, . ~ .*Z)
# This fives the following formula:
f2
Y ~ scale(x1, scale = F) + scale(x2, scale = F) + scale(x1*x2, scale = F) +
Z + scale(x1, scale = F):Z + scale(x2, scale = F):Z + scale(x1*x2, scale = F):Z
Alternative 2: Use Model Matrices
# again fit without Z and get model matrix
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
modmat <- apply(model.matrix(mod), 2, function(x) scale(x, scale=F))
Here, all x's and the interactions are demeaned:
> head(modmat)
(Intercept) x1 x2 x1:x2
[1,] 0 0.1042908 -0.08989091 -0.01095459
[2,] 0 0.1611867 -0.32677059 -0.05425087
[3,] 0 0.2206845 0.29820499 0.06422944
[4,] 0 0.3462069 -0.15636463 -0.05571430
[5,] 0 0.3194451 -0.38668844 -0.12510551
[6,] 0 -0.4708222 -0.32502269 0.15144812
> round(colMeans(modmat), 2)
(Intercept) x1 x2 x1:x2
0 0 0 0
You can use the model matrix as follows:
modmat <- modmat[, -1] # remove intercept
lm(sdata$Y ~ modmat*sdata$Z)
It is not beautiful, but should do the work with any number of explanatory variables. You can also add Y and Z to the matrix so that the output looks prettier if this is a concern. Note that you can also create the model matrix directly without fitting the model. I took it from the fitted model directly since it have already fitted it for the first approach.
As a sidenote, it may be that this is not implemented in a more straight forward fashion because it is difficult to imagine situations in which demeaning the interaction is more desirable compared to the interaction of demeaned variables.
Comparing both approaches:
Here the output of both approaches for comparison. As you can see, apart from the coefficient names everything is identical.
> lm(sdata$Y ~ modmat*sdata$Z)
Call:
lm(formula = sdata$Y ~ modmat * sdata$Z)
Coefficients:
(Intercept) modmatx1 modmatx2 modmatx1:x2 sdata$Z
4.33105 1.56455 1.43979 -0.09206 1.72901
modmatx1:sdata$Z modmatx2:sdata$Z modmatx1:x2:sdata$Z
0.25332 0.38155 -0.66292
> lm(f2, data=sdata)
Call:
lm(formula = f2, data = sdata)
Coefficients:
(Intercept) scale(x1, scale = F) scale(x2, scale = F)
4.33105 1.56455 1.43979
scale(x1 * x2, scale = F) Z scale(x1, scale = F):Z
-0.09206 1.72901 0.25332
scale(x2, scale = F):Z scale(x1 * x2, scale = F):Z
0.38155 -0.66292

R lm: Create regressions dynamically

I have a set of dependent variables y1, y2, ...., a set of independent variables x1,x2,..., and a set of controls d1,d2,.... These are all inside a data.table, lets call it data.
I need to do something along the lines of
out1 <- lm(y1 ~ x1, data=data)
out2 <- lm(y1 ~ x1 + d1 + d2, data=data)
....
This is of course not very nice, so I was thinking about writing a list containing all these regressions, and than just iterate through that. Something along the lines of
myRegressions <- list('out1' = y1 ~ x1, 'out2' = y1 ~ x1 + d1 + d2)
output <- NULL
for (reg in myRegressions)
{
output[reg] <- lm(myRegressions[[reg]])
}
This of course won't work: I cannot construct the list as the syntax is invalid outside of lm(). What's a good approach here instead?
You can use paste0 and as.formula to generate formulas and then simply put them into lm(), e. g.
regressors <- c("x1", "x1 + x2", "x1 + x2 + x3")
for (i in 1:length(regressors)) {
print(as.formula(paste0("y1", "~", regressors[i])))
}
This gives you the formulas (printed). Just store them in a list and iterate over that list with lapply like
lapply(stored_formulas, function(x) { lm(x, data=yourData) })
Using the built in anscombe data frame try this:
formulas = list(y1 ~ x1, y2 ~ x2)
lapply(formulas, function(fo) do.call("lm", list(fo, data = quote(anscombe))))
giving:
[[1]]
Call:
lm(formula = y1 ~ x1, data = anscombe)
Coefficients:
(Intercept) x1
3.0001 0.5001
[[2]]
Call:
lm(formula = y2 ~ x2, data = anscombe)
Coefficients:
(Intercept) x2
3.001 0.500
Note that the Call: portion of the output is accurately produced which will be useful if there are many components to the output list.
Formulas can be quoted :
myReg <- list('out1' = "mpg ~ cyl")
lm(myReg[[1]],data=mtcars)
Call:
lm(formula = myReg[[1]], data = mtcars)
Coefficients:
(Intercept) cyl
37.885 -2.876

How can I pass an argument as a character to a function within a function?

I'm trying to create a series of models based on subsets of different categories in my data. Instead of creating a bunch of individual model objects, I'm using lapply() to create a list of models based on subsets of every level of my category factor, like so:
test.data <- data.frame(y=rnorm(100), x1=rnorm(100), x2=rnorm(100), category=rep(c("A", "B"), 2))
run.individual.models <- function(x) {
lm(y ~ x1 + x2, data=test.data, subset=(category==x))
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
# [[1]]
# Call:
# lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
# x))
# Coefficients:
# (Intercept) x1 x2
# 0.10852 -0.09329 0.11365
# ....
This works fantastically, except the model call shows subset = (category == x) instead of category == "A", etc. This makes it more difficult to use both for diagnostic purposes (it's hard to remember which model in the list corresponds to which category) and for functions like predict().
Is there a way to substitute the actual character value of x into the lm() call so that the model doesn't use the raw x in the call?
Along the lines of Explicit formula used in linear regression
Use bquote to construct the call
run.individual.models <- function(x) {
lmc <- bquote(lm(y ~ x1 + x2, data=test.data, subset=(category==.(x))))
eval(lmc)
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
[[1]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"A"))
Coefficients:
(Intercept) x1 x2
-0.08434 0.05881 0.07695
[[2]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"B"))
Coefficients:
(Intercept) x1 x2
0.1251 -0.1854 -0.1609

Resources