I am trying to use rlang::exec in custom functions where I want to pass additional arguments as a list and then splice them. Usually this works without any problem. But I am encountering problems while doing this routine when there is a formula argument involved.
without list
library(rlang)
exec(
.fn = stats::t.test,
formula = wt ~ am,
data = mtcars
)
#>
#> Welch Two Sample t-test
#>
#> data: wt by am
#> t = 5.4939, df = 29.234, p-value = 6.272e-06
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#> 0.8525632 1.8632262
#> sample estimates:
#> mean in group 0 mean in group 1
#> 3.768895 2.411000
with list
extra.args <- list(formula = wt ~ am)
exec(
.fn = stats::t.test,
data = mtcars,
!!!extra.args
)
#> Error in t.test.default(data = structure(list(mpg = c(21, 21, 22.8, 21.4, : argument "x" is missing, with no default
How can I get this to work?
I'm not sure this is rlang::exec's fault. The problem really has to do with S3 dispatch and the fact that different functions are called based on the class of the first parameter, not the names of the parameters. With your current calling method, you are passing data= before your formula. This also causes a problem when calling the function directly
stats::t.test(data=mtcars, formula=wt~am)
The easiest way to get around this would be to pass the parameters in the "natural" order for proper S3 dispatch to take place
extra.args <- list(formula = wt ~ am)
exec(
.fn = stats::t.test,
!!!extra.args,
data = mtcars
)
or leave the formula parameter unnamed so it becomes the first unnamed-parameter.
extra.args <- list(wt ~ am)
exec(
.fn = stats::t.test,
data = mtcars,
!!!extra.args
)
Related
I am trying to create a call to mice::with.mids(), then evaluate it. It appears the call is being created, but then it cannot be evaluated (must be some environment/scoping issue?). I've created a pared down reproducible example below. Any assistance is greatly appreciated! Thank you!
mice_in_tbl_uvregression <-
function(data, # mice data of class mids
method, # regression method, eg. lm, glm
formula = "hp ~ mpg", # character formula (needs to be character for other reasons)
method.args = NULL # named list of other args that will be passed to `method=`
) {
# construct the call
fun_call <-
rlang::call2(
rlang::expr(with),
data = data,
expr = rlang::expr((!!method)(formula = !!as.formula(formula), !!!method.args))
)
# evaluate call
eval(fun_call)
}
set.seed(12345)
mice_in_tbl_uvregression(
data = mice::mice(mtcars, m = 2),
method = lm
)
#> Error in eval(predvars, data, env): object 'hp' not found
Created on 2021-06-27 by the reprex package (v2.0.0)
We could parse a string created (to extract the language call) before doing the evaluation
mice_in_tbl_uvregression <-
function(data, # mice data of class mids
method, # regression method, eg. lm, glm
formula = "hp ~ mpg", # character formula (needs to be character for other reasons)
method.args = NULL # named list of other args that will be passed to `method=`
) {
# construct the call
fun_call <- parse(text = glue::glue("with(data = {deparse(substitute(data))}, expr = {deparse(substitute(method))}(as.formula({formula})))"))
print(fun_call[[1]])
out <- eval(fun_call)
out$call$expr[[2]] <- out$call$expr[[2]][[2]]
out
}
-testing
set.seed(12345)
out1 <- mice_in_tbl_uvregression(
data = mice::mice(mtcars, m = 2),
method = lm
)
-output
out1
call :
with.mids(data = mice::mice(mtcars, m = 2), expr = lm(hp ~ mpg))
call1 :
mice::mice(data = mtcars, m = 2)
nmis :
mpg cyl disp hp drat wt qsec vs am gear carb
0 0 0 0 0 0 0 0 0 0 0
analyses :
[[1]]
Call:
lm(formula = as.formula(hp ~ mpg))
Coefficients:
(Intercept) mpg
324.08 -8.83
[[2]]
Call:
lm(formula = as.formula(hp ~ mpg))
Coefficients:
(Intercept) mpg
324.08 -8.83
I have many regression objects created by lm(). Each one has been built from a different data
frame, and these different data frames have different dimensions. But each data frame contains the logical variables x, y, and z. In some cases, I want to update each regression object so that the subset argument is x. In other cases, I want to update each regression object so that the subset argument is y. And in still other cases, I want to update each regression object so that the subset argument is z. What is an efficient way to do this?
This is the inefficient way:
# Set only one of these three variables to be TRUE
subsetX <- TRUE
subsetY <- FALSE
subsetZ <- FALSE
# Now update the regressions.
if (subsetX) {
update(lm1, subset = x)
update(lm2, subset = x)
[...]
} else if (subsetY) {
update(lm1, subset = y)
update(lm2, subset = y)
[...]
} else if (subsetZ) {
update(lm1, subset = z)
update(lm2, subset = z)
[...]
}
This approach is inefficient because there is a lot of duplication across the three code blocks that update the regressions. I would rather do something like
subsetVar <- dplyr::case_when(
subsetX ~ expression(x),
subsetY ~ expression(y),
subsetZ ~ expression(z))
update(lm1, subset = substitute(subsetVar))
update(lm2, subset = substitute(subsetVar))
[...]
That is, I would like to write at most one update() command for each
regression object, while still varying the subset argument on the basis of
logical (boolean) variables like subsetX and subsetY. Is this possible?
The code above doesn't work; when I try it, I get an Error in xj[i] : invalid subscript type 'symbol' error message.
I've searched other Stack Overflow questions, but I haven't found anything that speaks directly to this problem.
Here's a way to make your workflow a little easier with a simple function and using purrr::map2 to feed it the list of models and subsets you want
library(purrr)
set.seed(2020)
mtcars$x <- sample(c(TRUE, FALSE), 32, replace = TRUE)
mtcars$y <- sample(c(TRUE, FALSE), 32, replace = TRUE)
mtcars$z <- sample(c(TRUE, FALSE), 32, replace = TRUE)
lm1 <- lm(mpg ~ hp, mtcars)
subset_lm_by <- function(model, subset = NULL) {
if (subset == "x") {
update(model, subset = x)
} else if (subset == "y") {
update(model, subset = y)
} else if (subset == "z") {
update(model, subset = z)
} else {
# cat('I only accept x, y or z!')
}
}
models <- list(lm1, lm1, lm1, lm1)
subsets <- list("x", "y", "z", "nonsense")
purrr::map2(.x = models,
.y = subsets,
~ subset_lm_by(model = .x, subset = .y))
#> [[1]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = x)
#>
#> Coefficients:
#> (Intercept) hp
#> 31.21178 -0.08098
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = y)
#>
#> Coefficients:
#> (Intercept) hp
#> 32.83501 -0.07294
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = mpg ~ hp, data = mtcars, subset = z)
#>
#> Coefficients:
#> (Intercept) hp
#> 32.53554 -0.08688
#>
#>
#> [[4]]
#> NULL
I'm writing a function for my (working) R script in order to clean up my code. I do not have experience with writing functions, but decided I should invest some time into this. The goal of my function is to perform multiple statistical tests while only passing the required dataframe, quantitative variable and grouping variable once. However, I cannot get this to work. For your reference, I'll use the ToothGrowth data frame to illustrate my problem.
Say I want to run a Kruskal-Wallis test and one-way ANOVA on len, to compare different groups named supp, for whatever reason. I can do this separately with
kruskal.test(len ~ supp, data = ToothGrowth)
aov(len ~ supp, data = ToothGrowth)
Now I want to write a function that performs both tests. This is what I had thought should work:
stat_test <- function(mydata, quantvar, groupvar) {
kruskal.test(quantvar ~ groupvar, data = mydata)
aov(quantvar ~ groupvar, data = mydata)
}
But if I then run stat_test(ToothGrowth, "len", "sup"), I get the error
Error in kruskal.test.default("len", "supp") :
all observations are in the same group
What am I doing wrong? Any help would be much appreciated!
You can use deparse(substitute(quantvar)) to get the quoted name of the column you are passing to the function, and this will allow you to build a formula using paste. This is a more idiomatic way of operating in R.
Here's a reproducible example:
stat_test <- function(mydata, quantvar, groupvar) {
A <- as.formula(paste(deparse(substitute(quantvar)), "~",
deparse(substitute(groupvar))))
print(kruskal.test(A, data = mydata))
cat("\n--------------------------------------\n\n")
aov(A, data = mydata)
}
stat_test(ToothGrowth, len, supp)
#>
#> Kruskal-Wallis rank sum test
#>
#> data: len by supp
#> Kruskal-Wallis chi-squared = 3.4454, df = 1, p-value = 0.06343
#>
#>
#> --------------------------------------
#> Call:
#> aov(formula = A, data = mydata)
#>
#> Terms:
#> supp Residuals
#> Sum of Squares 205.350 3246.859
#> Deg. of Freedom 1 58
#>
#> Residual standard error: 7.482001
#> Estimated effects may be unbalanced
Created on 2020-03-30 by the reprex package (v0.3.0)
It looks like you need to convert your variable arguments, given as text strings, into a formula. You can do this by concatenating the strings with paste(). Also, you will need to wrap print() around both of your statistical tests within the function, otherwise only the last one will display.
Here is the modified function:
stat_test <- function(mydata, quantvar, groupvar) {
model_formula <- formula(paste(quantvar, '~', groupvar))
print(kruskal.test(model_formula, data = mydata))
print(aov(model_formula, data = mydata))
}
For reference, if using rstatix (tidy version of statistical functions), you need to use sym and !!, while using formula() when needed.
make_kruskal_test <- function(data, quantvar, groupvar) {
library(rstatix, quietly = TRUE)
library(rlang, quietly = TRUE)
formula_expression <- formula(paste(quantvar, "~", groupvar))
quantvar_sym <- sym(quantvar)
shapiro <- shapiro_test(data, !!quantvar_sym) %>% print()
}
sample_data <- tibble::tibble(sample = letters[1:5], mean = 1:5)
make_kruskal_test(sample_data, "mean", "sample")
#> # A tibble: 1 x 3
#> variable statistic p
#> <chr> <dbl> <dbl>
#> 1 mean 0.987 0.967
I know there are several ways to compare regression models. One way it to create models (from linear to multiple) and compare R2, Adjusted R2, etc:
Mod1: y=b0+b1
Mod2: y=b0+b1+b2
Mod3: y=b0+b1+b2+b3 (etc)
I´m aware that some packages could perform a stepwise regression, but I'm trying to analyze that with purrr. I could create several simple linear models (Thanks for this post here), and now I want to Know how can create regression models adding a specific IV to equation:
reproducible code
data(mtcars)
library(tidyverse)
library(purrr)
library(broom)
iv_vars <- c("cyl", "disp", "hp")
make_model <- function(nm) lm(mtcars[c("mpg", nm)])
fits <- Map(make_model, iv_vars)
glance_tidy <- function(x) c(unlist(glance(x)), unlist(tidy(x)[, -1]))
t(iv_vars %>% Map(f = make_model) %>% sapply(glance_tidy))
Output
What I want:
Mod1: mpg ~cyl
Mod2: mpg ~cly + disp
Mod3: mpg ~ cly + disp + hp
Thanks much.
I would begin by creating a list tibble storing your formulae. Then map the model over the formula, and map glance over the models.
library(tidyverse)
library(broom)
mtcars %>% as_tibble()
formula <- c(mpg ~ cyl, mpg ~ cyl + disp)
output <-
tibble(formula) %>%
mutate(model = map(formula, ~lm(formula = .x, data = mtcars)),
glance = map(model, glance))
output$glance
output %>% unnest(glance)
You could cumulatively paste over your vector of id_vars to get the combinations you want. I used the code in this answer to do this.
I use the plus sign as the separator between variables to get ready for the formula notation in lm.
cumpaste = function(x, .sep = " ") {
Reduce(function(x1, x2) paste(x1, x2, sep = .sep), x, accumulate = TRUE)
}
( iv_vars_cum = cumpaste(iv_vars, " + ") )
[1] "cyl" "cyl + disp" "cyl + disp + hp"
Then switch the make_model function to use a formula and a dataset. The explanatory variables, separated by the plus sign, get passed to the function after the tilde in the formula. Everything is pasted together, which lm conveniently interprets as a formula.
make_model = function(nm) {
lm(paste0("mpg ~", nm), data = mtcars)
}
Which we can see works as desired, returning a model with both explanatory variables.
make_model("cyl + disp")
Call:
lm(formula = as.formula(paste0("mpg ~", nm)), data = mtcars)
Coefficients:
(Intercept) cyl disp
34.66099 -1.58728 -0.02058
You'll likely need to rethink how you want to combine the info together, as you will now how differing numbers of columns due to the increased number of coefficients.
A possible option is to add dplyr::bind_rows to your glance_tidy function and then use map_dfr from purrr for the final output.
glance_tidy = function(x) {
dplyr::bind_rows( c( unlist(glance(x)), unlist(tidy(x)[, -1]) ) )
}
iv_vars_cum %>%
Map(f = make_model) %>%
map_dfr(glance_tidy, .id = "model")
# A tibble: 3 x 28
model r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cyl 0.7261800 0.7170527 3.205902 79.56103 6.112687e-10 2 -81.65321 169.3064
2 cyl + disp 0.7595658 0.7429841 3.055466 45.80755 1.057904e-09 3 -79.57282 167.1456
3 cyl + disp + hp 0.7678877 0.7430186 3.055261 30.87710 5.053802e-09 4 -79.00921 168.0184 ...
I'm creating a custom function in R that takes as input a few different variables and creates a data.frame, a plot, and some summary stats, all stored in a list. I'd like to only print out the summary stats when calling the function, but have the plot and data.frame available when called explicitly.
I think what I want is similar to how lm() operates but I'm not sure how it is acheiving that.
When I print the object returned by lm I only get a printout of $call and $coefficients:
lm(mtcars$mpg ~ mtcars$cyl)
Call:
lm(formula = mtcars$mpg ~ mtcars$cyl)
Coefficients:
(Intercept) mtcars$cyl
37.885 -2.876
But clearly behind the scenes there is much more available in the function call to lm.
lm(mtcars$mpg[1:3] ~ mtcars$cyl[1:3])$residuals
1 2 3
-1.280530e-15 1.280530e-15 8.365277e-31
> unclass(lm(mtcars$mpg[1:3] ~ mtcars$cyl[1:3])
Call:
lm(formula = mtcars$mpg[1:3] ~ mtcars$cyl[1:3])
Coefficients:
(Intercept) mtcars$cyl[1:3]
26.4 -0.9
> unclass(lm(mtcars$mpg[1:3] ~ mtcars$cyl[1:3]))
$coefficients
(Intercept) mtcars$cyl[1:3]
26.4 -0.9
$residuals
1 2 3
-1.280530e-15 1.280530e-15 8.365277e-31
$effects
(Intercept) mtcars$cyl[1:3]
-3.741230e+01 1.469694e+00 1.810943e-15
....
$call
lm(formula = mtcars$mpg[1:3] ~ mtcars$cyl[1:3])
$model
mtcars$mpg[1:3] mtcars$cyl[1:3]
1 21.0 6
2 21.0 6
3 22.8 4
I looked at the code for lm but it's not very clear to me what is going on.
The result of a call to lm is an object with the class attribute set to lm. Objects of this class have their own print method (which you can call explicitly if you want using print.lm). You can do something similar yourself simply by setting the class attribute of the object returned by your function and then writing your own print method. Here's an example:
my.func <- function(x, y, z){
library(ggplot2)
df <- data.frame(x, y, z)
p <- ggplot(df, aes(x, y)) + geom_point()
ds <- sapply(df, summary)
op <- list(data = df, plot = p, summary = ds)
class(op) <- 'my_list'
op
}
print.my_list <- function(m){
print(m$summary)
}
a <- my.func(1:5, 5:1, rnorm(5))
a
print.default(a)
Because the list a has the class attribute set to my_list, once you have created a print method for it, this method is used whenever you print a list with that class. You can see the entire object by explicitly calling print.default. There is a very good explanation of classes in R here: http://adv-r.had.co.nz/OO-essentials.html.