{gtsummary} R package -- Problem passing arguments from `add_difference()` to custom function - r

I'm trying to use a custom function for testing in a tbl_summary |> add_difference() call.
Basically, whatever I pass the function via the test.arg argument in add_difference doesn't get to the called function. I made a simplified version of my code below--I'm pretty sure it's far from optimized and probably full of other bugs I can't see because I'm stuck at the call! πŸ˜•
library(tidyverse)
library(gtsummary)
# Most args are necessary for any function to be compatible with
# add_difference()
testMe <- function(data,
variable,
by,
group = NULL,
type = NULL,
conf.level = 0.95,
adj.vars = NULL,
# test_name is the one I got problems with
test_name = NULL,
ci_type = 0.95,
continuous_variable = NULL,
na.rm,
...) {
# Turn strings into actual symbol for valiables
variable <- ensym(variable)
by <- ensym(by)
# Use functions from infer to make statistical computations
estimate <- data |> specify(expr(!!variable ~ !!by)) |>
calculate(stat = test_name, na.rm = na.rm)
booted <- data |>
specify(expr(!!variable ~ !!by)) |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = test_name, na.rm = na.rm)
result_df <- get_confidence_interval(booted,
estimate,
type = "bias-corrected",
point_estimate = estimate) |>
rename(conf.low = lower_ci,
conf.high = upper_ci)
# returns a df with column names as specified in gtsummary docs
return(result_df)
}
This is the gtsummary code:
# Turn the group factor into character (per add_difference()) and reduce
# groups to 2.
pdata <- PlantGrowth |> filter((group == "trt1") | (group == "trt2")) |>
mutate(group = as.character(group))
table <- data |>
tbl_summary(
by = group,
statistic = list(all_continuous() ~ "{median} ({p25}β€”{p75})"),
) |>
add_difference(
test = list(weight ~ "testMe"),
# Here we pass a string containing the name of the estimate which
# the infer functions will need.
test.args = weight ~ list(test_name = "diff in medians")
# The following selection, which I've seen in documentation and on the
# Internet, doesn't work for me.
# test.args = all_tests("testMe) ~ list(test_name = "diff in medians")
)
In my attempts, the testMe function is called but its test_name argument is always NULL (i.e. default); all subsequent errors are due to its being blank, and if I set a sensible default, the function always returns the subsequent result.
When I change testMe not to have a default test_name, I get an "argument missing" error. In other words, test.args isn't working for me…
Am I getting something wrong? Thanks!

Related

Function can't find column name in tibble

I'm trying to create a function that uses dplyr syntax to manipulate data, but the function can't find the column names.
# example code below
library(dplyr)
# create sample data
ex.dat = data.frame(ex.IV = c(rep(1,50),
rep(2,50)),
ex.DV = c(rnorm(n = 50, mean = 100, sd = 15),
rnorm(n = 50, mean = 115, sd = 15)))
# create simple function that finds mean and sd from sample data
ex.func = function(data,predictor,predicted){
as.tibble(data) %>%
group_by(predictor) %>%
summarise(
M = mean(predicted),
SD = sd(predicted)
)
}
# run function with sample data
ex.func(data = ex.dat, predictor = ex.IV, predicted = ex.DV)
This produces the following error: "Error: Must group by variables found in .data. Column predictor is not found."
I don't understand why the function isn't assigning ex.IV to predictor.
Running the same code without involving a function, of course, has no issues, e.g.,
as.tibble(ex.dat) %>%
group_by(ex.IV) %>%
summarise(
M = mean(ex.DV),
SD = sd(ex.DV))
produces the intended result, so the issue must reside in the function formatting.
Workarounds like:
ex.func(data = ex.dat, predictor = ex.dat$ex.IV, predicted = ex.dat$ex.DV)
ex.func(data = ex.dat, predictor = data$ex.IV, predicted = data$ex.DV)
receive the same errors.
Clearly I'm not understanding some basic operations of function(). I'd appreciate some pointers.
We could make use of the curly-curly ({{}}) operator as the input argument are unquoted
ex.func <- function(data, predictor, predicted){
as.tibble(data) %>%
group_by({{predictor}}) %>%
summarise(
M = mean({{predicted}}),
SD = sd({{predicted}})
)
}
Now run as
ex.func(data = ex.dat, predictor = ex.IV, predicted = ex.DV)
If we need flexible option is the argument passed can be either quoted or unquoted, then we may need to convert to symbol with ensym and evaluate (!!
ex.func <- function(data, predictor, predicted){
predictor <- rlang::ensym(predictor)
predicted <- rlang::ensym(predicted)
as.tibble(data) %>%
group_by(!!predictor) %>%
summarise(
M = mean(!!predicted),
SD = sd(!!predicted)
)
}
Then, we can call either as
ex.func(data = ex.dat, predictor = ex.IV, predicted = ex.DV)
Or
ex.func(data = ex.dat, predictor = "ex.IV", predicted = "ex.DV")

How to apply a custom function to nested dataframes?

I'm trying to apply a custom function to a nested dataframe
I want to apply a machine learning algorithm to predict NA values
After doing a bit of reading online, it seemed that the map function would be the most applicable here
I have a section of code that nests the dataframe and then splits the data into a test (data3) and train (data2) set - with the test dataset containing all the null values for the column to be predicted, and the train containing all the values that are not null to be used to train the ML model
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))))
Here is the function I intend to use:
my_function (test,train) {
et <- extraTrees(x = train, y = train[, "mean_night_flow"], na.action = "fuse", ntree = 1000, nodesize = 2, mtry = ncol(train) * 0.9 )
test1 <- test
test1[ , "mean_night_flow"] <- 0
pred <- predict(et, newdata = test1[, "mean_night_flow"])
test1[ , "mean_night_flow"] <- pred
return(test1)
I have tried the following code, however it does not work:
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))),
data4 = map(data3, data2, ~my_function(.x,.y)))
It gives the following error:
Error: Index 1 must have length 1, not 33
This is suggests that it expects a column rather than a whole dataframe. How can I get this to work?
Many thanks
Without testing on your data, I think you're using the wrong map function. purrr::map works on one argument (one list, one vector, whatever) and returns a list. You are passing it two values (data3 and data2), so we need to use:
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))),
data4 = map2(data3, data2, ~my_function(.x,.y)))
If you find yourself needing more than two, you need pmap. You can use pmap for 1 or 2 arguments, it's effectively the same. The two biggest differences when migrating from map to pmap are:
your arguments need to be enclosed within a list, so
map2(data3, data12, ...)
becomes
pmap(list(data3, data12), ...)
you refer to them with double-dot number position, ..1, ..2, ..3, etc, so
~ my_function(.x, .y)
becomes
~ my_function(..1, ..2)
An alternative that simplifies your overall flow just a little.
my_function (test, train = NULL, fld = "mean_night_flow") {
if (is.null(train)) {
train <- test[ !is.na(test[[fld]]),, drop = FALSE ]
test <- test[ is.na(test[[fld]]),, drop = FALSE ]
}
et <- extraTrees(x = train, y = train[, fld], na.action = "fuse", ntree = 1000, nodesize = 2, mtry = ncol(train) * 0.9 )
test1 <- test
test1[ , fld] <- 0
pred <- predict(et, newdata = test1[, fld])
test1[ , fld] <- pred
return(test1)
}
which auto-populates train based on the missingness of your field. (I also parameterized it in case you ever need to train/test on a different field.) This changes your use to
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data4 = map(data, ~ my_function(.x, fld = "mean_night_flow")))
(It's important to name fld=, since otherwise it will be confused for train.)
If you're planning on reusing data2 and/or data3 later in the pipe or analysis, then this step is not necessarily what you need.
Note: I suspect your function in under-tested or incomplete. The fact that you assign all 0 to your test1[,"mean_night_flow"] and then use those zeroes in your call to predict seems suspect. I might be missing something, but I would expect perhaps
test1 <- test
pred <- predict(et, newdata = test1)
test1[ , fld] <- pred
return(test1)
(though copying to test1 using tibble or data.frame is mostly unnecessary, since it is copied in-place and the original frame is untouched; I would be more cautious if you were using class data.table).

Custom ML function not working: undefined columns selected

I am trying to write a custom function to do logistic regression-based ML with the caTools package, but I keep getting the error: undefined columns selected.
I checked the input to xlearn and ylearn arguments to the logit_boost function and, as explained in the documentation, they are respectively dataframe containing feature and a vector of labels. So not sure what I am doing wrong.
# needed libraries
library(dplyr)
library(rlang)
library(caTools)
# function body
logit_boost <- function(data, x, y, split_size = 0.8) {
# creating a dataframe
data <-
dplyr::select(.data = data,
!!rlang::enquo(x),
!!rlang::enquo(y))
# for reproducibility
set.seed(123)
# creating indices to choose rows from the data
train_indices <-
base::sample(x = base::seq_len(length.out = nrow(data)),
size = floor(split_size * nrow(data)))
# training dataset
train <- data[train_indices, ]
# testing dataset
test <- data[-train_indices, ]
# defining label column we are interested in and everything else
label_train <-
train %>% dplyr::select(.data = ., !!rlang::enquo(x))
data_train <-
train %>% dplyr::select(.data = ., -!!rlang::enquo(x))
# training model (y ~ x)
logit_model <-
caTools::LogitBoost(xlearn = data_train,
ylearn = label_train)
# prediction
# stats::predict(object = logit_model, test, type = "raw")
}
logit_boost(data = mtcars, x = am, y = mpg)
#> Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)): undefined columns selected
In help(LogitBoost) examples section, Label = iris[, 5] results in a vector, as expected in the ylearn argument to LogitBoost().
In your code, label_train <- train %>% dplyr::select(.data = ., !!rlang::enquo(x)) results in a data.frame. dplyr, by design, defaults to drop = FALSE (and even ignores the argument) when only one column is selected.
We could do:
logit_model <- caTools::LogitBoost(xlearn = data_train, ylearn = dplyr::pull(label_train))

R: Predictions from a list of coxph objects on newdata

I am building a series of Cox regression models, and getting predictions from those models on new data. I am able to get the expected number of events in some cases, but not others.
For example, if the formula in the coxph call is written out, then the predictions are calculated. But, if the the formula is stored in an object and that object called, I get an error. I also cannot get the predictions if I try to create them within a dplyr piped mutate function (for the function I am writing, this would be the most ideal place to get the predictions to work properly).
Any assistance is greatly appreciated!
Thank you,
Daniel
require(survival)
require(tidyverse)
n = 15
# creating tibble of tibbles.
results =
tibble(id = 1:n) %>%
group_by(id) %>%
do(
# creating tibble to evaluate model on
tbl0 = tibble(time = runif(n), x = runif(n)),
# creating tibble to build model on
tbl = tibble(time = runif(n), x = runif(n))
) %>%
ungroup
#it works when the formula is added the the coxph function already written out
map2(results$tbl, results$tbl0, ~ predict(coxph( Surv(time) ~ x, data = .x), newdata = .y, type = "expected"))
#but if the formula is previously defined, I get an error
f = as.formula(Surv(time) ~ x)
map2(results$tbl, results$tbl0, ~ predict(coxph( f, data = .x), newdata = .y, type = "expected"))
# I also get an error when I try to include in a dplyr pipe with mutate
results %>%
mutate(
pred = map2(tbl, tbl0, ~ predict(coxph( f, data = .x), newdata = .y, type = "expected"))
)
I figured it out (with the help of a friend). If you define the formula as a string, and within the function call coerce it to a formula everything runs smoothly. I am not sure why it works, but it does!
#define the formula as a string, and call it in the function with as.formula(.)
f = "Surv(time) ~ x"
map2(results$tbl, results$tbl0, ~ predict(coxph( as.formula(f), data = .x), newdata = .y, type = "expected"))
#also works in a dplyr pipe with mutate
results %>%
mutate(
pred = map2(tbl, tbl0, ~ predict(coxph( as.formula(f), data = .x), newdata = .y, type = "expected"))
)

How can I handle multiple parameters without using nested if else statements?

I am trying to write a function that has up to 6 potential conditions. The function creates a linear model that could have up to three parameters and is either scaled or not scaled. I tried using nested if else statements, but have run into issues because I have too many conditions (scaled or not X 3ivs possible = 6 potential conditions). How can I simplify my code so it is easier to read?
This is what I tried to write, but it does not currently work.
test<-data.frame(a=sample.int(20,10,replace=T),b=sample.int(20,10,replace=T),c=sample.int(20,10,replace=T),d=sample.int(20,10,replace=T))
lm3iv<-function(dv,iv1,iv2=NA,iv3=NA,df,scale=F){
dn<-dv;in1<-iv1;in2<-iv2;in3<-iv3 #stores the names of variables/elements specified in function
dv<-eval(parse(text=paste0(df,"$",dv))) #Store output of (df,"$",dv) as variable dv; parse=trun string into text; eval=return the values given by an expression that is passed to eval (by parse in this case).
#return(dv)
iv1<-eval(parse(text=paste0(df,"$",iv1)))
if(!is.na(iv2)){iv2<-eval(parse(text=paste0(df,"$",iv2)))}
if(!is.na(iv3)){iv3<-eval(parse(text=paste0(df,"$",iv3)))}
ifelse(scale,
ifelse(!is.na(iv3),
{
x<-lm(scale(dv)~scale(iv1)+scale(iv2)+scale(iv3))
names(x$coefficients)<-c(dn,in1,in2,in3) #set names of coefficients (element of x) from x (object defined in above line); names=pulling specific elements of coefficients
return(summary(x))
},
ifelse(!is.na(iv2),{
x<-lm(scale(dv)~scale(iv1)+scale(iv2))
names(x$coefficients)<-c(dn,in1,in2)
return(summary(x))
},
{
x<-lm(scale(dv)~scale(iv1))
names(x$coefficients)<-c(dn,in1)
return(summary(x))
},
ifelse(!is.na(iv3),
return(summary(lm((dv)~(iv1)+(iv2)+(iv3)))),
ifelse(!is.na(iv2),
return(summary(lm((dv)~(iv1)+(iv2)))),
return(summary(lm((dv)~(iv1)))))))
) #format=ifelse(cond,if_true,if_false)
)
}
#tried adding list() before the first summary to return model outpot and the model its self; have to add the model again after the comma after summary() - e.g., lm(scale(dv)~scale(iv1)+scale(iv2)+scale(iv3))) (same model being summarized). Have to create a variable when passing the data into the model. When reading output -> model.output(model.std.el.ns.tw[[1]],model.std.el.ns.tw[[2]]).
lm3iv("a","b","c","d",df="test",scale=F)
This is the error I'm getting, but I am trying to simplify my code, not just solve the error:
Error in ifelse(scale, ifelse(!is.na(iv3), { :
argument "no" is missing, with no default
Here is you function organised. I hope it works because I didn't care for what the function do, just for the arguments of the ifelse(test, yes, no)
lm3iv<-function(dv,iv1,iv2=NA,iv3=NA,df,scale=F){
dn<-dv;in1<-iv1;in2<-iv2;in3<-iv3 #stores the names of variables/elements specified in function
dv<-eval(parse(text=paste0(df,"$",dv))) #Store output of (df,"$",dv) as variable dv; parse=trun string into text; eval=return the values given by an expression that is passed to eval (by parse in this case).
#return(dv)
iv1<-eval(parse(text=paste0(df,"$",iv1)))
if(!is.na(iv2)){iv2<-eval(parse(text=paste0(df,"$",iv2)))}
if(!is.na(iv3)){iv3<-eval(parse(text=paste0(df,"$",iv3)))}
ifelse(test = scale,
yes = ifelse(test = !is.na(iv3),
yes = {
x<-lm(scale(dv)~scale(iv1)+scale(iv2)+scale(iv3))
names(x$coefficients)<-c(dn,in1,in2,in3) #set names of coefficients (element of x) from x (object defined in above line); names=pulling specific elements of coefficients
return(summary(x))
},
no = ifelse(test = !is.na(iv2),
yes = {
x<-lm(scale(dv)~scale(iv1)+scale(iv2))
names(x$coefficients)<-c(dn,in1,in2)
return(summary(x))
},
no = {
x<-lm(scale(dv)~scale(iv1))
names(x$coefficients)<-c(dn,in1)
return(summary(x))
})),
no = ifelse(test = !is.na(iv3),
yes = return(summary(lm((dv)~(iv1)+(iv2)+(iv3)))),
no = ifelse(test = !is.na(iv2),
yes = return(summary(lm((dv)~(iv1)+(iv2)))),
no = return(summary(lm((dv)~(iv1))))))
) #format=ifelse(cond,if_true,if_false)
}
Instead of using a bunch of ifelse statements, perhaps you should build a formula from the function arguments and then pass it to lm?
library(tidyverse)
lm2 <- function(dep_var, ind_var, ..., data, scaled = FALSE) {
if (scaled) {
data <- mutate_if(data, is.numeric, scale)
}
form <- paste(dep_var, '~',
paste(ind_var, ..., sep = '+')) %>%
as.formula()
lm(form, data = data) %>% summary()
}
# tests
# 1 independent var
lm(Petal.Width ~ Sepal.Width, data = iris) %>% summary()
lm2('Petal.Width', 'Sepal.Width', data = iris)
# 2 independent var
lm(Petal.Width ~ Sepal.Width + Sepal.Length, data = iris) %>% summary()
lm2('Petal.Width', 'Sepal.Width', 'Sepal.Length', data = iris)
# test scale option
iris_scaled <- mutate(iris,
Petal.Width = scale(Petal.Width),
Sepal.Width = scale(Sepal.Width))
lm(Petal.Width ~ Sepal.Width, data = iris_scaled) %>% summary()
lm2('Petal.Width', 'Sepal.Width', data = iris, scaled = TRUE)

Resources