Use quasiquotation for formula syntax in a user-created function?

Use quasiquotation for formula syntax in a user-created function? - r

When I run this code:
# Create example data
df <- tibble(age=rnorm(10),
income=rnorm(10))
make_model <- function(response_var, df){
# Create formula
form <- as.formula(response_var ~ .)
# Create model
model <- lm(form , data=df)
# Return coefficients
return(coef(model))
}
make_model(income, df)
I obtain the following error
Error in eval(predvars, data, env) : object 'income' not found
How can I make this function work using quasiquotation? I assume the logic is the same as how we can call library(dplyr) instead of library("dplyr").

Use blast() (to be included in rlang 0.5.0)
blast <- function(expr, env = caller_env()) {
eval_bare(enexpr(expr), env)
}
make_model <- function(data, column) {
f <- blast(!!enexpr(column) ~ .)
model <- lm(f, data = data)
coef(model)
}
df <- data.frame(
age = rnorm(10),
income = rnorm(10)
)
make_model(df, income)
#> (Intercept) age
#> -0.3563103 -0.2200773
Works flexibly:
blast(list(!!!1:3))
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3

The following should work:
library(tidyverse)
# Your original function, modified
make_model <- function(df, column) {
column <- enexpr(column)
form <- as.formula(paste0(quo_text(column), " ~ ."))
model <- lm(form, data = df)
return(coef(model))
}
# Your original data and call
tibble(
age = rnorm(10),
income = rnorm(10)
) %>%
make_model(income)

Related

Use lapply with formula to estimate lm with different weights

Related but slightly different to How to use lapply with a formula? and Calling update within a lapply within a function, why isn't it working?:
I am trying to estimate models with replicate weights. For correct standard errors, I need to estimate the same regression model with each version of the replicate weights. Since I need to estimate many different models and do not want to always write a seperate loop, I tried writing a function where I specify both data for the regression, regression formula and the data with the replicate weights. While the function works fine when specifying the formula explicitly inside the lapply() command in the function and not as a function input (function tryout below), as soon as I specify the regression formula as a function input (function tryout2 below), it breaks.
Here is a reproducible example:
library(tidyverse)
set.seed(123)
lm.dat <- data.frame(id=1:500,
x1=sample(1:100, replace=T, size=500),
x2=runif(n=500, min=0, max=20)) %>%
mutate(y=0.2*x1+1.5*x2+rnorm(n=500, mean=0, sd=5))
repweights <- data.frame(id=1:500)
set.seed(123)
for (i in 1:200) {
repweights[,i+1] <- runif(n=500, min=0, max=10)
names(repweights)[i+1] <- paste0("hrwgt", i)
}
The two functions are defined as follows:
trythis <- function(data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) lm(data=data, formula=y~x1+x2, weights=weightsdata[,x]))
return(res)
}
results1 <- trythis(data=lm.dat, weightsdata=repweights[-1], weightsN=200)
trythis2 <- function(LMformula, data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) lm(data=data, formula=LMformula, weights=weightsdata[,x]))
return(res)
}
While the first function works, applying the second one results in an error:
trythis2(LMformula = y~x1+x2, data=lm.dat, weightsN=200, weightsdata = repweights[-1])
Error in eval(extras, data, env) : object 'weightsdata' not found

Formulas have an associated environment in which the referenced variables can be found. In your case, the formula you are passing has the environment of the calling frame. To access the variables within the function, you need to assign the formula to the local frame so it can find the correct variables:
trythis3 <- function(LMformula, data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) {
environment(LMformula) <- sys.frames()[[length(sys.frames())]]
lm(data = data, formula = LMformula, weights = weightsdata[,x])
})
return(res)
}
trythis3(LMformula = y~x1+x2, data = lm.dat, weightsN = 200,
weightsdata = repweights[-1])
Which results in
#> [[1]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
...etc

Creating loop over columns to calculate regression and then compare best combination of variables

I am trying to run a loop which takes different columns of a dataset as the dependent variable and remaining variables as the independent variables and run the lm command.
Here's my code
quant<-function(a){
i=1
colnames1<-colnames(a)
lm_model <- linear_reg() %>%
set_engine('lm') %>% # adds lm implementation of linear regression
set_mode('regression')
for (i in 1:ncol(a)) {
lm_fit <- lm_model %>%
fit(colnames1[i] ~ ., data = set1)
comp_matrix[i]<-tidy(lm_fit)[1,2]
i<-i+1
}
}
When I provide it with a dataset. It is showing this error.
> quant(set1)
Error in model.frame.default(formula = colnames1[i] ~ ., data = data, : variable lengths differ (found for 'Imp of Family')
I will be using comp_matrix for coefficient comparison among models later on. Is there a better way to do this fundamentally?
Sample Data in picture:
Packages used:
library(dplyr)
library(haven)
library(ggplot2)
library(tidyverse)
library(broom)
library(modelsummary)
library(parsnip)

We could change the line of fit with
fit(as.formula(paste(colnames1[i], "~ .")), data = a)
-full function
quant<-function(a){
a <- janitor::clean_names(a)
colnames1 <- colnames(a)
lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
out_lst <- vector('list', ncol(a))
for (i in seq_along(a)) {
lm_fit <- lm_model %>%
fit(as.formula(paste(colnames1[i], "~ .")), data = a)
out_lst[[i]]<-tidy(lm_fit)[1,2]
}
out_lst
}
-testing
> dat <- tibble(col1 = 1:5, col2 = 5:1)
> quant(dat)
[[1]]
# A tibble: 1 × 1
estimate
<dbl>
1 6
[[2]]
# A tibble: 1 × 1
estimate
<dbl>
1 6

Run all posible combination of linear regression with 2 independent variables

I want to run every combination possible for every 2 independent variables (OLS regression). I have a csv where I have my data (just one dependent variable and 23 independent variables), and I've tried renaming the variables inside my database from a to z, and I called 'y' to my dependent variable (a column with name "y" which is my dependent variable) to be recognized by the following code:
#all the combinations
all_comb <- combn(letters, 2)
#create the formulas from the combinations above and paste
text_form <- apply(all_comb, 2, function(x) paste('Y ~', paste0(x, collapse = '+')))
lapply(text_form, function(i) lm(i, data= KOFS05.12))
but this error is shown:
Error in eval(predvars, data, env) : object 'y' not found
I need to know the R squared
Any idea to make it work and run every possible regression?

As mentioned in the comments under the question check whether you need y or Y. Having addressed that we can use any of these. There is no need to rename the columns. We use the built in mtcars data set as an example since no test data was provided in the question. (Please always provide that in the future.)
1) ExhaustiveSearch This runs quite fast so you might be able to try combinations higher than 2 as well.
library(ExhaustiveSearch)
ExhaustiveSearch(mpg ~., mtcars, combsUpTo = 2)
2) combn Use the lmfun function defined below with combn.
dep <- "mpg" # name of dependent variable
nms <- setdiff(names(mtcars), dep) # names of indep variables
lmfun <- function(x, dep) do.call("lm", list(reformulate(x, dep), quote(mtcars)))
lms <- combn(nms, 2, lmfun, dep = dep, simplify = FALSE)
names(lms) <- lapply(lms, formula)
3) listcompr Using lmfun from above and listcompr we can use the following. Note that we need version 0.1.1 or later of listcompr which is not yet on CRAN so we get it from github.
# install.github("patrickroocks/listcompr")
library(listcompr)
packageVersion("listcompr") # need version 0.1.1 or later
dep <- "mpg" # name of dependent variable
nms <- setdiff(names(mtcars), dep) # names of indep variables
lms2 <- gen.named.list("{nm1}.{nm2}", lmfun(c(nm1, nm2), dep),
nm1 = nms, nm2 = nms, nm1 < nm2)

You should specify your text_form as formulas:
KOFS05.12 <- data.frame(y = rnorm(10),
a = rnorm(10),
b = rnorm(10),
c = rnorm(10))
all_comb <- combn(letters[1:3], 2)
fmla_form <- apply(all_comb, 2, function(x) as.formula(sprintf("y ~ %s", paste(x, collapse = "+"))))
lapply(fmla_form, function(i) lm(i, KOFS05.12))
#> [[1]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) a b
#> 0.19763 -0.15873 0.02854
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) a c
#> 0.21395 -0.15967 0.05737
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) b c
#> 0.157140 0.002523 0.028088
Created on 2021-02-17 by the reprex package (v1.0.0)

update regression objects by passing an expression to "subset"

I have many regression objects created by lm(). Each one has been built from a different data
frame, and these different data frames have different dimensions. But each data frame contains the logical variables x, y, and z. In some cases, I want to update each regression object so that the subset argument is x. In other cases, I want to update each regression object so that the subset argument is y. And in still other cases, I want to update each regression object so that the subset argument is z. What is an efficient way to do this?
This is the inefficient way:
# Set only one of these three variables to be TRUE
subsetX <- TRUE
subsetY <- FALSE
subsetZ <- FALSE
# Now update the regressions.
if (subsetX) {
update(lm1, subset = x)
update(lm2, subset = x)
[...]
} else if (subsetY) {
update(lm1, subset = y)
update(lm2, subset = y)
[...]
} else if (subsetZ) {
update(lm1, subset = z)
update(lm2, subset = z)
[...]
}
This approach is inefficient because there is a lot of duplication across the three code blocks that update the regressions. I would rather do something like
subsetVar <- dplyr::case_when(
subsetX ~ expression(x),
subsetY ~ expression(y),
subsetZ ~ expression(z))
update(lm1, subset = substitute(subsetVar))
update(lm2, subset = substitute(subsetVar))
[...]
That is, I would like to write at most one update() command for each
regression object, while still varying the subset argument on the basis of
logical (boolean) variables like subsetX and subsetY. Is this possible?
The code above doesn't work; when I try it, I get an Error in xj[i] : invalid subscript type 'symbol' error message.
I've searched other Stack Overflow questions, but I haven't found anything that speaks directly to this problem.

Here's a way to make your workflow a little easier with a simple function and using purrr::map2 to feed it the list of models and subsets you want
library(purrr)
set.seed(2020)
mtcars$x <- sample(c(TRUE, FALSE), 32, replace = TRUE)
mtcars$y <- sample(c(TRUE, FALSE), 32, replace = TRUE)
mtcars$z <- sample(c(TRUE, FALSE), 32, replace = TRUE)
lm1 <- lm(mpg ~ hp, mtcars)
subset_lm_by <- function(model, subset = NULL) {
if (subset == "x") {
update(model, subset = x)
} else if (subset == "y") {
update(model, subset = y)
} else if (subset == "z") {
update(model, subset = z)
} else {
# cat('I only accept x, y or z!')
}
}
models <- list(lm1, lm1, lm1, lm1)
subsets <- list("x", "y", "z", "nonsense")
purrr::map2(.x = models,
.y = subsets,
~ subset_lm_by(model = .x, subset = .y))
#> [[1]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = x)
#>
#> Coefficients:
#> (Intercept) hp
#> 31.21178 -0.08098
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = y)
#>
#> Coefficients:
#> (Intercept) hp
#> 32.83501 -0.07294
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = z)
#>
#> Coefficients:
#> (Intercept) hp
#> 32.53554 -0.08688
#>
#>
#> [[4]]
#> NULL

R formula in survfit

I am having trouble with formulas, environments, and survfit().
Things work fine for lm() but they fail for survfit().
General problem statement:
I am fitting a series of formulas to some data. So, I call the
modeling function with the formula passed as a variable. Later,
I want to work with the formula from the fitted object.
(From my naive point of view, the trouble comes from survfit not
recording the environment.)
Detailed Example
Expected behaviour as seen in lm():
library("plyr")
preds <- c("wt", "qsec")
f <- function() {
lm(mpg ~ wt, data = mtcars)
}
fits <- alply(preds, 1, function(pred)
{
modform <- reformulate(pred, response = "mpg")
lm(modform, data = mtcars)
})
fits[[1]]$call$formula
##modform
formula(fits[[1]])
## mpg ~ wt
## <environment: 0x1419d1a0>
Even though fits[[1]]$call$formula resolves to modform I can
still get the original formula with formula(fits[[1]]).
But things fail for survfit():
library("plyr")
library("survival")
preds <- c("resid.ds", "rx", "ecog.ps")
fits <-
alply(preds, 1, function(pred)
{
modform <- paste("Surv(futime, fustat)", pred, sep = " ~ ")
modform <- as.formula(modform)
print(modform)
fit <- survfit(modform, data = ovarian)
})
fits[[1]]$call$formula
## modform
formula(fits[[1]])
## Error in eval(expr, envir, enclos) : object 'modform' not found
Here (and in contrast to lm-fits), formula(fits[[1]]) does not
work!
So, my specific question is: How can I get back the formula used
to fit fits[[1]]?

The issue is that when x$formula is NULL, for an lm object there is a backup plan to get the formula; this doesn't exist for survfit objects
library("plyr")
library("survival")
preds <- c("wt", "qsec")
f <- function() lm(mpg ~ wt, data = mtcars)
fits <- alply(preds, 1, function(pred) {
modform <- reformulate(pred, response = "mpg")
lm(modform, data = mtcars)
})
fits[[1]]$formula
# NULL
The formula can be extracted with formula(fits[[1]]) which uses the formula generic. The lm S3 method for formula is
stats:::formula.lm
# function (x, ...)
# {
# form <- x$formula
# if (!is.null(form)) {
# form <- formula(x$terms)
# environment(form) <- environment(x$formula)
# form
# }
# else formula(x$terms)
# }
So when fits[[1]]$formula returns NULL, forumla.lm looks for a terms attribute in the object and finds the formula that way
fits[[1]]$terms
The survfit objects don't have x$formula or x$terms, so formula(x) givens an error
preds <- c("resid.ds", "rx", "ecog.ps")
fits <- alply(preds, 1, function(pred) {
modform <- paste("Surv(futime, fustat)", pred, sep = " ~ ")
modform <- as.formula(modform)
fit <- survfit(modform, data = ovarian)
})
fits[[1]]$formula
# NULL
formula(fits[[1]]) ## error
formula(fits[[1]]$terms)
# list()
You can fix this by inserting the formula into the call and evaluating it
modform <- as.formula(paste("Surv(futime, fustat)", 'rx', sep = " ~ "))
substitute(survfit(modform, data = ovarian), list(modform = modform))
# survfit(Surv(futime, fustat) ~ rx, data = ovarian)
eval(substitute(survfit(modform, data = ovarian), list(modform = modform)))
# Surv(futime, fustat) ~ rx
# Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
#
# n events median 0.95LCL 0.95UCL
# rx=1 13 7 638 268 NA
# rx=2 13 5 NA 475 NA
Or by manually putting the formula into x$call$formula
fit <- survfit(modform, data = ovarian)
fit$call$formula
# modform
fit$call$formula <- modform
fit$call$formula
# Surv(futime, fustat) ~ rx
fit
# Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
#
# n events median 0.95LCL 0.95UCL
# rx=1 13 7 638 268 NA
# rx=2 13 5 NA 475 NA

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Use quasiquotation for formula syntax in a user-created function? - r

Related

Use lapply with formula to estimate lm with different weights

Creating loop over columns to calculate regression and then compare best combination of variables

Run all posible combination of linear regression with 2 independent variables

update regression objects by passing an expression to "subset"

R formula in survfit

Categories

Resources