How do I create a "macro" for regressors in R? - r

For long and repeating models I want to create a "macro" (so called in Stata and there accomplished with global var1 var2 ...) which contains the regressors of the model formula.
For example from
library(car)
lm(income ~ education + prestige, data = Duncan)
I want something like:
regressors <- c("education", "prestige")
lm(income ~ #regressors, data = Duncan)
I could find is this approach. But my application on the regressors won't work:
reg = lm(income ~ bquote(y ~ .(regressors)), data = Duncan)
as it throws me:
Error in model.frame.default(formula = y ~ bquote(.y ~ (regressors)), data =
Duncan, : invalid type (language) for variable 'bquote(.y ~ (regressors))'
Even the accepted answer of same question:
lm(formula(paste('var ~ ', regressors)), data = Duncan)
strikes and shows me:
Error in model.frame.default(formula = formula(paste("var ~ ", regressors)),
: object is not a matrix`.
And of course I tried as.matrix(regressors) :)
So, what else can I do?

Here are some alternatives. No packages are used in the first 3.
1) reformulate
fo <- reformulate(regressors, response = "income")
lm(fo, Duncan)
or you may wish to write the last line as this so that the formula that is shown in the output looks nicer:
do.call("lm", list(fo, quote(Duncan)))
in which case the Call: line of the output appears as expected, namely:
Call:
lm(formula = income ~ education + prestige, data = Duncan)
2) lm(dataframe)
lm( Duncan[c("income", regressors)] )
The Call: line of the output look like this:
Call:
lm(formula = Duncan[c("income", regressors)])
but we can make it look exactly as in the do.call solution in (1) with this code:
fo <- formula(model.frame(income ~., Duncan[c("income", regressors)]))
do.call("lm", list(fo, quote(Duncan)))
3) dot
An alternative similar to that suggested by #jenesaisquoi in the comments is:
lm(income ~., Duncan[c("income", regressors)])
The approach discussed in (2) to the Call: output also works here.
4) fn$ Prefacing a function with fn$ enables string interpolation in its arguments. This solution is nearly identical to the desired syntax shown in the question using $ in place of # to perform substitution and the flexible substitution could readily extend to more complex scenarios. The quote(Duncan) in the code could be written as just Duncan and it will still run but the Call: shown in the lm output will look better if you use quote(Duncan).
library(gsubfn)
rhs <- paste(regressors, collapse = "+")
fn$lm("income ~ $rhs", quote(Duncan))
The Call: line looks almost identical to the do.call solutions above -- only spacing and quotes differ:
Call:
lm(formula = "income ~ education+prestige", data = Duncan)
If you wanted it absolutely the same then:
fo <- fn$formula("income ~ $rhs")
do.call("lm", list(fo, quote(Duncan)))

For the scenario you described, where regressors is in the global environment, you could use:
lm(as.formula(paste("income~", paste(regressors, collapse="+"))), data =
Duncan)
Alternatively, you could use a function:
modincome <- function(regressors){
lm(as.formula(paste("income~", paste(regressors, collapse="+"))), data =
Duncan)
}
modincome(c("education", "prestige"))

Related

Use the formula("string") with felm() from the lfe package while also using fixed effects

I'm trying to run a large regression formula that is created somewhere else as a long string. I also want to use "fixed effects" (individual specific intercepts).
Without fixed effects this works both in the lm() and in felm() functions:
library("lfe")
MyData <- data.frame(country = c("US","US","DE","DE"),
y = rnorm(4),
x = rnorm(4))
testformula <- "y ~ x"
lm(formula(testformula),
data = MyData)
felm(formula(testformula),
data = MyData)
There is also no problem with this kind of regression in felm() if I use country fixed effects:
felm(y ~ x | country,
data = MyData)
However, when I try to combine both the formula() function and the fixed effects argument, I get an error:
felm(formula(testformula) | country ,
data = MyData)
"Error in terms(formula(as.Formula(formula), rhs = 1), specials = "G") :
Object 'country' not found"
I find this strange, separately, both of these arguments work. How can I use the formula() function in felm() and still work with the convenient fixed effects syntax of that function? I don't want to write the fixed effects into the formula because I want to rely on the within transformations of the lfe package.
p.s.: This works in plm() by the way so I'm guessing there is something odd in the felm() function or I input it badly.
library("plm")
plm(formula(testformula),
data = MyData,
index = c("country"),
model = "within",
effect = "individual")
Since the fixed effects are part of the formula*, we can include them in the formula string.
fit1 <- felm(y ~ x | country, data=MyData)
testformula <- "y ~ x | country"
fit2 <- felm(formula(testformula), data=MyData)
fit2
# x
# 0.3382
all.equal(fit1$coefficients, fit2$coefficients)
# [1] TRUE
*you can see this by the fact that function parameters in R are usually separated by commas

How do I use safely with coxph and subset or weights?

I'm trying to use purrr::safely with coxph so that I can capture error messages. I've made a safe version of coxph as follows
library(survival)
library(purrr)
coxph_safe <- safely(coxph)
This works perfectly when my only inputs are the formula and data, however, if I add another input such as subset or weights, I get the following error message:
simpleError in eval(substitute(subset), data, env): ..3 used in an incorrect context, no ... to look in
Does anyone know how to apply safely to coxph when additional inputs are required? I also get the same error using quietly instead of safely, and also if I make a safe version of lm and specify a subset. I'm using R 3.6.1 and purrr 0.3.2. For now, I've programmed a workaround, where I subset the data before applying coxph_safe, but it would be good to know if there was a better solution.
Here's a simple example:
test1 <- list(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,1,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
# Without subset
coxph(Surv(time, status) ~ x, test1) # Works as expected
coxph_safe(Surv(time, status) ~ x, test1) # Works as expected
# With subset
coxph(Surv(time, status) ~ x, test1, subset = !sex) # Works as expected
coxph_safe(Surv(time, status) ~ x, test1, subset = !sex) # Error!
Edit
On a related note, I also get a similar error when applying anova to a coxph object generated via coxph_safe.
cox_1 <- coxph(Surv(time, status) ~ x, test1) # Works as expected
anova(cox_1) # Works as expected
cox_1s <- coxph_safe(Surv(time, status) ~ x, test1) # Works as expected
anova(cox_1s$result) # Error in is.data.frame(data) : ..2 used in an incorrect context, no ... to look in
As far as I can tell, this has something to do with how the call is stored. I can fix it by over-writing the call.
cox_1$call # coxph(formula = Surv(time, status) ~ x, data = test1)
cox_1s$result$call # .f(formula = ..1, data = ..2)
cox_1s$result$call <- cox_1$call
anova(cox_1s$result) # Now works as expected
Is there a better way around this?
This actually has nothing to do with purrr::safely. The issue is function nesting. Consider:
f <- function(...) {coxph(...)}
f(Surv(time, status) ~ x, test1) # Works
f(Surv(time, status) ~ x, test1, subset=!sex) # Error
The real reason for why it fails has to do with the behavior of substitute() inside nested functions. coxph() uses substitute(), and safely() creates a nested function, leading to the scenario described in my link.
To address this issue, we need to wrap coxph() into a function that properly handles non-standard evaluation (NSE):
coxph_nse <- function(...) {eval(rlang::expr(coxph( !!!rlang::enexprs(...) )))}
The new function no longer suffers the same nesting issues and can be safely passed to safely():
coxph_safe <- safely(coxph_nse)
coxph_safe(Surv(time, status) ~ x, test1) # works
cx1 <- coxph_safe(Surv(time, status) ~ x, test1, subset=!sex) # now also works!
anova(cx1$result) # works as well!

Pass dynamically variable names in lm formula inside a function

I have a function that asks for two parameters:
dataRead (dataframe from the user)
variableChosen (which dependent variable the user wants to utilize
in the model)
Obs: indepent variable will always be the first column
But if the user gives me for example, a dataframe called dataGiven which columns names are: "Doses", "Weight"
I want that my model name has these names in my results
My actual function correctly make the lm, but my formula names from the data frame are gone (and shows how I got the data from the function)
Results_REG<- function (dataRead, variableChosen){
fit1 <- lm(formula = dataRead[,1]~dataRead[,variableChosen])
return(fit1)
}
When I call:
test1 <- Results_REG(dataGive, "Weight")
names(teste1$model)
shows:
"dataRead[, 1]" "dataRead[, variableChosen]"
I wanted to show my dataframe columns names, like:
"Doses" "Weight"
First off, it's always difficult to help without a reproducible code example. For future posts I recommend familiarising yourself with how to provide such a minimal reproducible example.
I'm not entirely clear on what you're asking, so I assume this is about how to create a function that fits a simple linear model based on data with a single user-chosen predictor var.
Here is an example based on mtcars
results_LM <- function(data, var) {
lm(data[, 1] ~ data[, var])
}
results_LM(mtcars, "disp")
#Call:
#lm(formula = data[, 1] ~ data[, var])
#
#Coefficients:
#(Intercept) data[, var]
# 29.59985 -0.04122
You can confirm that this gives the same result as
lm(mpg ~ disp, data = mtcars)
Or perhaps you're asking how to carry through the column names for the predictor? In that case we can use as.formula to construct a formula that we use together with the data argument in lm.
results_LM <- function(data, var) {
fm <- as.formula(paste(colnames(data)[1], "~", var))
lm(fm, data = data)
}
fit <- results_LM(mtcars, "disp")
fit
#Call:
#lm(formula = fm, data = data)
#
#Coefficients:
#(Intercept) disp
# 29.59985 -0.04122
names(fit$model)
#[1] "mpg" "disp"
outcome <- 'mpg'
model <- lm(mtcars[,outcome] ~ . ,mtcars)
yields the same result as:
data(mtcars)
model <- lm( mpg ~ . ,mtcars)
but allows you to pass a variable (the column name). However, this may cause an error where mpg is included in the right hand side of the equation as well. Not sure if anyone knows how to fix that.

How to correctly pass formulas associated with a variable name with random effects into fitted regression models in `R`?

I currently have a problem in that I have to pre-specify my formulas before sending them into a regression function. For example, using the stan_gamm4 function in R, we have the following example:
dat <- mgcv::gamSim(1, n = 400, scale = 2) ## simulate 4 term additive truth
## Now add 20 level random effect `fac'...
dat$fac <- fac <- as.factor(sample(1:20, 400, replace = TRUE))
dat$y <- dat$y + model.matrix(~ fac - 1) %*% rnorm(20) * .5
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = ~ (1 | fac),
chains = 1, iter = 200) # for example speed
Now, because the formula and random formula were specified explicitly, then if we call:
br$call$random
> ~(1 | fac)
We are able to retrieve the form of the random effects.
NOW, let us then leave everything the same, BUT use an expression for the random part:
formula.rand <- as.formula( '~(1|fac)' )
Then, if we did the same thing before, but with formula.rand taking the place, we have:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand,
chains = 1, iter = 200) # for example speed
BUT NOW: we have that:
br$call$random
> formula.rand
Instead of the original. A lot of bayesian packages rely on accessing br$call$random, so is there a way to use a variable for formula, have it pass in, AND retain the original relation when calling br$call$random? Thanks.
While I haven't used Stan, this is a problem inherent in the way that R handles storing calls. You can see it happening with lm, for example:
model <- function(formula)
{
lm(formula, data=mtcars)
}
m <- model(mpg ~ disp)
m$call$formula
# formula
The simplest solution is to construct the call using substitute to insert the actual values you want to keep, not the symbol name. In the case of lm, this would be something like
model2 <- function(formula)
{
call <- substitute(lm(formula=.f, data=mtcars), list(.f=formula))
eval(call)
}
m2 <- model2(mpg ~ disp)
m2$call$formula
# mpg ~ disp
For Stan, you can do
stan_call <- substitute(br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data=dat, random=.rf,
chains=1, iter=200),
list(.rf=formula.rand))
br <- eval(stan_call)
If I understand correctly, your problem is not, that stan_gamm4 could be computing incorrect results (which is not the case, from what I gather), but only that br$call$random refers to the variable name and not the formula. This seems to be problematic for further post-processing of the model.
Since stan_gamm4 uses match.call inside to find the call, I don't know of a way to specify the model differently to obtain a "correct" br$call$random up front. But you can simply modify it after the fact via:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand)
br$call$random <- formula.rand
br$call$random
#> ~(1 | fac)
and the continue with whatever you are doing.
IMHO, this is not a problem with stan_gamm4. In your second example, if you then do
class(br$call$random)
you will see that it is of class "name". So, it is not as if $call is just some list with stuff in it. In order to access it programatically in general, you need to evaluate that with
eval(br$call$random)
in order to obtain ~(1 | fac), which is of class "formula".

Give the formula of a SVM with R

I use this code for my SVM prediction
library(gdata)
data = read.csv2("test.csv")
data
library(e1071)
model <- svm(cote ~ .,data,kernel='radial')
#model1 <- svm(y ~ x1+x2, data=f, type='nu-classification',kernel='radial',tolerance=0.001,gamma=2.5,cost=2,nu=0.8,cross=10,shrinking=FALSE)
predict(model, subset(data, select = - c(cote)))
Now I need to take the literal formula of this SVM to paste it on a C++ program. How can I do that ?
Thx
Maybe the formula can be recovered from the 'model'-object. Try this:
model$call[[2]]
Example:
> ?e1071::predict.svm
> model <- svm(Species ~ ., data = iris)
> model$call[[2]]
# Species ~ .
If you want that as a character variable the usual methods of coercion work as expected.

Resources