Writing function around statistical tests in R - r

I'm writing a function for my (working) R script in order to clean up my code. I do not have experience with writing functions, but decided I should invest some time into this. The goal of my function is to perform multiple statistical tests while only passing the required dataframe, quantitative variable and grouping variable once. However, I cannot get this to work. For your reference, I'll use the ToothGrowth data frame to illustrate my problem.
Say I want to run a Kruskal-Wallis test and one-way ANOVA on len, to compare different groups named supp, for whatever reason. I can do this separately with
kruskal.test(len ~ supp, data = ToothGrowth)
aov(len ~ supp, data = ToothGrowth)
Now I want to write a function that performs both tests. This is what I had thought should work:
stat_test <- function(mydata, quantvar, groupvar) {
kruskal.test(quantvar ~ groupvar, data = mydata)
aov(quantvar ~ groupvar, data = mydata)
}
But if I then run stat_test(ToothGrowth, "len", "sup"), I get the error
Error in kruskal.test.default("len", "supp") :
all observations are in the same group
What am I doing wrong? Any help would be much appreciated!

You can use deparse(substitute(quantvar)) to get the quoted name of the column you are passing to the function, and this will allow you to build a formula using paste. This is a more idiomatic way of operating in R.
Here's a reproducible example:
stat_test <- function(mydata, quantvar, groupvar) {
A <- as.formula(paste(deparse(substitute(quantvar)), "~",
deparse(substitute(groupvar))))
print(kruskal.test(A, data = mydata))
cat("\n--------------------------------------\n\n")
aov(A, data = mydata)
}
stat_test(ToothGrowth, len, supp)
#>
#> Kruskal-Wallis rank sum test
#>
#> data: len by supp
#> Kruskal-Wallis chi-squared = 3.4454, df = 1, p-value = 0.06343
#>
#>
#> --------------------------------------
#> Call:
#> aov(formula = A, data = mydata)
#>
#> Terms:
#> supp Residuals
#> Sum of Squares 205.350 3246.859
#> Deg. of Freedom 1 58
#>
#> Residual standard error: 7.482001
#> Estimated effects may be unbalanced
Created on 2020-03-30 by the reprex package (v0.3.0)

It looks like you need to convert your variable arguments, given as text strings, into a formula. You can do this by concatenating the strings with paste(). Also, you will need to wrap print() around both of your statistical tests within the function, otherwise only the last one will display.
Here is the modified function:
stat_test <- function(mydata, quantvar, groupvar) {
model_formula <- formula(paste(quantvar, '~', groupvar))
print(kruskal.test(model_formula, data = mydata))
print(aov(model_formula, data = mydata))
}

For reference, if using rstatix (tidy version of statistical functions), you need to use sym and !!, while using formula() when needed.
make_kruskal_test <- function(data, quantvar, groupvar) {
library(rstatix, quietly = TRUE)
library(rlang, quietly = TRUE)
formula_expression <- formula(paste(quantvar, "~", groupvar))
quantvar_sym <- sym(quantvar)
shapiro <- shapiro_test(data, !!quantvar_sym) %>% print()
}
sample_data <- tibble::tibble(sample = letters[1:5], mean = 1:5)
make_kruskal_test(sample_data, "mean", "sample")
#> # A tibble: 1 x 3
#> variable statistic p
#> <chr> <dbl> <dbl>
#> 1 mean 0.987 0.967

Related

Use lapply with formula to estimate lm with different weights

Related but slightly different to How to use lapply with a formula? and Calling update within a lapply within a function, why isn't it working?:
I am trying to estimate models with replicate weights. For correct standard errors, I need to estimate the same regression model with each version of the replicate weights. Since I need to estimate many different models and do not want to always write a seperate loop, I tried writing a function where I specify both data for the regression, regression formula and the data with the replicate weights. While the function works fine when specifying the formula explicitly inside the lapply() command in the function and not as a function input (function tryout below), as soon as I specify the regression formula as a function input (function tryout2 below), it breaks.
Here is a reproducible example:
library(tidyverse)
set.seed(123)
lm.dat <- data.frame(id=1:500,
x1=sample(1:100, replace=T, size=500),
x2=runif(n=500, min=0, max=20)) %>%
mutate(y=0.2*x1+1.5*x2+rnorm(n=500, mean=0, sd=5))
repweights <- data.frame(id=1:500)
set.seed(123)
for (i in 1:200) {
repweights[,i+1] <- runif(n=500, min=0, max=10)
names(repweights)[i+1] <- paste0("hrwgt", i)
}
The two functions are defined as follows:
trythis <- function(data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) lm(data=data, formula=y~x1+x2, weights=weightsdata[,x]))
return(res)
}
results1 <- trythis(data=lm.dat, weightsdata=repweights[-1], weightsN=200)
trythis2 <- function(LMformula, data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) lm(data=data, formula=LMformula, weights=weightsdata[,x]))
return(res)
}
While the first function works, applying the second one results in an error:
trythis2(LMformula = y~x1+x2, data=lm.dat, weightsN=200, weightsdata = repweights[-1])
Error in eval(extras, data, env) : object 'weightsdata' not found
Formulas have an associated environment in which the referenced variables can be found. In your case, the formula you are passing has the environment of the calling frame. To access the variables within the function, you need to assign the formula to the local frame so it can find the correct variables:
trythis3 <- function(LMformula, data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) {
environment(LMformula) <- sys.frames()[[length(sys.frames())]]
lm(data = data, formula = LMformula, weights = weightsdata[,x])
})
return(res)
}
trythis3(LMformula = y~x1+x2, data = lm.dat, weightsN = 200,
weightsdata = repweights[-1])
Which results in
#> [[1]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
...etc

Run all posible combination of linear regression with 2 independent variables

I want to run every combination possible for every 2 independent variables (OLS regression). I have a csv where I have my data (just one dependent variable and 23 independent variables), and I've tried renaming the variables inside my database from a to z, and I called 'y' to my dependent variable (a column with name "y" which is my dependent variable) to be recognized by the following code:
#all the combinations
all_comb <- combn(letters, 2)
#create the formulas from the combinations above and paste
text_form <- apply(all_comb, 2, function(x) paste('Y ~', paste0(x, collapse = '+')))
lapply(text_form, function(i) lm(i, data= KOFS05.12))
but this error is shown:
Error in eval(predvars, data, env) : object 'y' not found
I need to know the R squared
Any idea to make it work and run every possible regression?
As mentioned in the comments under the question check whether you need y or Y. Having addressed that we can use any of these. There is no need to rename the columns. We use the built in mtcars data set as an example since no test data was provided in the question. (Please always provide that in the future.)
1) ExhaustiveSearch This runs quite fast so you might be able to try combinations higher than 2 as well.
library(ExhaustiveSearch)
ExhaustiveSearch(mpg ~., mtcars, combsUpTo = 2)
2) combn Use the lmfun function defined below with combn.
dep <- "mpg" # name of dependent variable
nms <- setdiff(names(mtcars), dep) # names of indep variables
lmfun <- function(x, dep) do.call("lm", list(reformulate(x, dep), quote(mtcars)))
lms <- combn(nms, 2, lmfun, dep = dep, simplify = FALSE)
names(lms) <- lapply(lms, formula)
3) listcompr Using lmfun from above and listcompr we can use the following. Note that we need version 0.1.1 or later of listcompr which is not yet on CRAN so we get it from github.
# install.github("patrickroocks/listcompr")
library(listcompr)
packageVersion("listcompr") # need version 0.1.1 or later
dep <- "mpg" # name of dependent variable
nms <- setdiff(names(mtcars), dep) # names of indep variables
lms2 <- gen.named.list("{nm1}.{nm2}", lmfun(c(nm1, nm2), dep),
nm1 = nms, nm2 = nms, nm1 < nm2)
You should specify your text_form as formulas:
KOFS05.12 <- data.frame(y = rnorm(10),
a = rnorm(10),
b = rnorm(10),
c = rnorm(10))
all_comb <- combn(letters[1:3], 2)
fmla_form <- apply(all_comb, 2, function(x) as.formula(sprintf("y ~ %s", paste(x, collapse = "+"))))
lapply(fmla_form, function(i) lm(i, KOFS05.12))
#> [[1]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) a b
#> 0.19763 -0.15873 0.02854
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) a c
#> 0.21395 -0.15967 0.05737
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) b c
#> 0.157140 0.002523 0.028088
Created on 2021-02-17 by the reprex package (v1.0.0)

passing formula in a list to `rlang::exec`

I am trying to use rlang::exec in custom functions where I want to pass additional arguments as a list and then splice them. Usually this works without any problem. But I am encountering problems while doing this routine when there is a formula argument involved.
without list
library(rlang)
exec(
.fn = stats::t.test,
formula = wt ~ am,
data = mtcars
)
#>
#> Welch Two Sample t-test
#>
#> data: wt by am
#> t = 5.4939, df = 29.234, p-value = 6.272e-06
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#> 0.8525632 1.8632262
#> sample estimates:
#> mean in group 0 mean in group 1
#> 3.768895 2.411000
with list
extra.args <- list(formula = wt ~ am)
exec(
.fn = stats::t.test,
data = mtcars,
!!!extra.args
)
#> Error in t.test.default(data = structure(list(mpg = c(21, 21, 22.8, 21.4, : argument "x" is missing, with no default
How can I get this to work?
I'm not sure this is rlang::exec's fault. The problem really has to do with S3 dispatch and the fact that different functions are called based on the class of the first parameter, not the names of the parameters. With your current calling method, you are passing data= before your formula. This also causes a problem when calling the function directly
stats::t.test(data=mtcars, formula=wt~am)
The easiest way to get around this would be to pass the parameters in the "natural" order for proper S3 dispatch to take place
extra.args <- list(formula = wt ~ am)
exec(
.fn = stats::t.test,
!!!extra.args,
data = mtcars
)
or leave the formula parameter unnamed so it becomes the first unnamed-parameter.
extra.args <- list(wt ~ am)
exec(
.fn = stats::t.test,
data = mtcars,
!!!extra.args
)

Large standard error of prediction from parsnip vs base R

It seems like predict is producing a standard error that is too large. I get 0.820 with a parsnip model but 0.194 with a base R model. 0.194 for a standard error seems more reasonable since about 2*0.195 above and below my prediction are the ends of the confidence interval. What is my problem/misunderstanding?
library(parsnip)
library(dplyr)
# example data
mod_dat <- mtcars %>%
as_tibble() %>%
mutate(cyl_8 = as.numeric(cyl == 8)) %>%
select(mpg, cyl_8)
parsnip_mod <- logistic_reg() %>%
set_engine("glm") %>%
fit(as.factor(cyl_8) ~ mpg, data = mod_dat)
base_mod <- glm(as.factor(cyl_8) ~ mpg, data = mod_dat, family = "binomial")
parsnip_pred <- tibble(mpg = 18) %>%
bind_cols(predict(parsnip_mod, new_data = ., type = 'prob'),
predict(parsnip_mod, new_data = ., type = 'conf_int', std_error = T)) %>%
select(!ends_with("_0"))
base_pred <- predict(base_mod, tibble(mpg = 18), se.fit = T, type = "response") %>%
unlist()
# these give the same prediction but different SE
parsnip_pred
#> # A tibble: 1 x 5
#> mpg .pred_1 .pred_lower_1 .pred_upper_1 .std_error
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 18 0.614 0.230 0.895 0.820
base_pred
#> fit.1 se.fit.1 residual.scale
#> 0.6140551 0.1942435 1.0000000
Created on 2020-06-04 by the reprex package (v0.3.0)
--EDIT--
As #thelatemail and #Limey said, using type="link" for the base model will give the standard error on the logit scale (0.820). However, I want the standard error on the probability scale.
Is there an option in the parsnip documentation that I'm missing? I would like to use parsnip.
#thelatemail is correct. From the online doc for predict.glm:
type
the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities.
The default is to report using the logit scale,, 'response' requests results on the raw probability scale. It's not obvious from the parsnip::predict documentation that I found how that chooses the scale on which to return its results, but it's clear it's using the raw probability scale.
So both methods are returning correct answers, they're just using different scales.
I don't want to steal an accepted solution from #thelatemail, so invite them to post a similar answer to this.
As #thelatemail said, you can get the standard error on the probability scale with parsnip using the arguments: type="raw", opts=list(se.fit=TRUE, type="response"). But at that point, you might as well use a base model since the output is exactly the same. However, this is still useful if you are already using a parsnip model and you want the standard error output of a base model.
library(parsnip)
library(dplyr)
mod_dat <- mtcars %>%
as_tibble() %>%
mutate(cyl_8 = as.numeric(cyl == 8)) %>%
select(mpg, cyl_8)
parsnip_mod <- logistic_reg() %>%
set_engine("glm") %>%
fit(as.factor(cyl_8) ~ mpg, data = mod_dat)
base_mod <- glm(as.factor(cyl_8) ~ mpg, data = mod_dat, family = "binomial")
predict(parsnip_mod, tibble(mpg = 18), type="raw",
opts=list(se.fit=TRUE, type="response")) %>%
as_tibble()
#> # A tibble: 1 x 3
#> fit se.fit residual.scale
#> <dbl> <dbl> <dbl>
#> 1 0.614 0.194 1
predict.glm(base_mod, tibble(mpg = 18), se.fit = T, type="response") %>%
as_tibble()
#> # A tibble: 1 x 3
#> fit se.fit residual.scale
#> <dbl> <dbl> <dbl>
#> 1 0.614 0.194 1
Created on 2020-06-11 by the reprex package (v0.3.0)

Dynamically insert variables into a fable model using rlang

I am trying to dynamically insert variables into a fable model.
Data
library(dplyr)
library(fable)
library(stringr)
df <- tsibbledata::aus_retail %>%
filter(State == "Victoria", Industry == "Food retailing") %>%
mutate(reg_test = rnorm(441, 5, 2),
reg_test2 = rnorm(441, 5, 2))
Note that there can be an undetermined number of regressors included in the tsibble, but in this example, I have only two (reg_test and reg_test2). All regressor columns will start with reg_
Problem Function
I have a function where I want to dynamically put the regressor columns into an ARIMA model using the fable package.
test_f <- function(df) {
var_names <- str_subset(names(df), "reg_") %>%
paste0(collapse = "+")
test <- enquo(var_names)
df %>%
model(ARIMA(Turnover ~ !!test))
}
test_f(df)
# A mable: 1 x 3
# Key: State, Industry [1]
State Industry `ARIMA(Turnover ~ ~"reg_test+reg_tes~
<chr> <chr> <model>
1 Victoria Food retaili~ <NULL model>
Warning message:
1 error encountered for ARIMA(Turnover ~ ~"reg_test+reg_test2")
[1] invalid model formula in ExtractVars
I know that it is just putting the string var_names into the formula, which does not work, but I can't figure out how to create var_names in such a way that I can enquo() it correctly.
I read through the Quasiquotation section here
I searched SO but have not found the answer yet.
This question with pasre_expr() seemed to get closer, but still not what I wanted.
I know that I can use sym() if I have one variable, but I don't know how many reg_ variables there will be and I want to include them all.
Expected Output
By putting in the variables manually, I can show the output that I expect.
test <- df %>%
model(ARIMA(Turnover ~ reg_test + reg_test2))
test$`ARIMA(Turnover ~ reg_test + reg_test2)`[[1]]
Series: Turnover
Model: LM w/ ARIMA(2,1,0)(0,1,2)[12] errors
Coefficients:
ar1 ar2 sma1 sma2 reg_test reg_test2
-0.6472 -0.3541 -0.4115 -0.0793 -0.0296 -0.6143
s.e. 0.0473 0.0479 0.0520 0.0446 0.5045 0.5273
sigma^2 estimated as 884.9: log likelihood=-2058.04
AIC=4130.08 AICc=4130.35 BIC=4158.5
I also imagine that there is a better way for me to make the formula in the ARIMA function. If this can fix my problem as well, that will work too.
I appreciate any help!
You're possibly making this a bit more complicated than it needs to be. You can convert a string to a formula by doing as.formula(string), so simply build your formula as a string, convert it to a formula, then feed it to ARIMA. Here's a reprex:
library(dplyr)
library(fable)
library(stringr)
df <- tsibbledata::aus_retail %>%
filter(State == "Victoria", Industry == "Food retailing") %>%
mutate(reg_test = rnorm(441, 5, 2),
reg_test2 = rnorm(441, 5, 2))
test_f <- function(df) {
var_names <- paste0(str_subset(names(df), "reg_"), collapse = " + ")
mod <- model(df, ARIMA(as.formula(paste("Turnover ~", var_names))))
unclass(mod[1, 3][[1]])[[1]]
}
test_f(df)
#> Series: Turnover
#> Model: LM w/ ARIMA(2,1,0)(0,1,1)[12] errors
#>
#> Coefficients:
#> ar1 ar2 sma1 reg_test reg_test2
#> -0.6689 -0.376 -0.4765 0.3363 1.0194
#> s.e. 0.0448 0.045 0.0426 0.4978 0.5436
#>
#> sigma^2 estimated as 883.1: log likelihood=-2058.28
#> AIC=4128.56 AICc=4128.76 BIC=4152.91
Created on 2020-04-23 by the reprex package (v0.3.0)

Resources