Function can't find column name in tibble - r

I'm trying to create a function that uses dplyr syntax to manipulate data, but the function can't find the column names.
# example code below
library(dplyr)
# create sample data
ex.dat = data.frame(ex.IV = c(rep(1,50),
rep(2,50)),
ex.DV = c(rnorm(n = 50, mean = 100, sd = 15),
rnorm(n = 50, mean = 115, sd = 15)))
# create simple function that finds mean and sd from sample data
ex.func = function(data,predictor,predicted){
as.tibble(data) %>%
group_by(predictor) %>%
summarise(
M = mean(predicted),
SD = sd(predicted)
)
}
# run function with sample data
ex.func(data = ex.dat, predictor = ex.IV, predicted = ex.DV)
This produces the following error: "Error: Must group by variables found in .data. Column predictor is not found."
I don't understand why the function isn't assigning ex.IV to predictor.
Running the same code without involving a function, of course, has no issues, e.g.,
as.tibble(ex.dat) %>%
group_by(ex.IV) %>%
summarise(
M = mean(ex.DV),
SD = sd(ex.DV))
produces the intended result, so the issue must reside in the function formatting.
Workarounds like:
ex.func(data = ex.dat, predictor = ex.dat$ex.IV, predicted = ex.dat$ex.DV)
ex.func(data = ex.dat, predictor = data$ex.IV, predicted = data$ex.DV)
receive the same errors.
Clearly I'm not understanding some basic operations of function(). I'd appreciate some pointers.

We could make use of the curly-curly ({{}}) operator as the input argument are unquoted
ex.func <- function(data, predictor, predicted){
as.tibble(data) %>%
group_by({{predictor}}) %>%
summarise(
M = mean({{predicted}}),
SD = sd({{predicted}})
)
}
Now run as
ex.func(data = ex.dat, predictor = ex.IV, predicted = ex.DV)
If we need flexible option is the argument passed can be either quoted or unquoted, then we may need to convert to symbol with ensym and evaluate (!!
ex.func <- function(data, predictor, predicted){
predictor <- rlang::ensym(predictor)
predicted <- rlang::ensym(predicted)
as.tibble(data) %>%
group_by(!!predictor) %>%
summarise(
M = mean(!!predicted),
SD = sd(!!predicted)
)
}
Then, we can call either as
ex.func(data = ex.dat, predictor = ex.IV, predicted = ex.DV)
Or
ex.func(data = ex.dat, predictor = "ex.IV", predicted = "ex.DV")

Related

{gtsummary} R package -- Problem passing arguments from `add_difference()` to custom function

I'm trying to use a custom function for testing in a tbl_summary |> add_difference() call.
Basically, whatever I pass the function via the test.arg argument in add_difference doesn't get to the called function. I made a simplified version of my code below--I'm pretty sure it's far from optimized and probably full of other bugs I can't see because I'm stuck at the call! πŸ˜•
library(tidyverse)
library(gtsummary)
# Most args are necessary for any function to be compatible with
# add_difference()
testMe <- function(data,
variable,
by,
group = NULL,
type = NULL,
conf.level = 0.95,
adj.vars = NULL,
# test_name is the one I got problems with
test_name = NULL,
ci_type = 0.95,
continuous_variable = NULL,
na.rm,
...) {
# Turn strings into actual symbol for valiables
variable <- ensym(variable)
by <- ensym(by)
# Use functions from infer to make statistical computations
estimate <- data |> specify(expr(!!variable ~ !!by)) |>
calculate(stat = test_name, na.rm = na.rm)
booted <- data |>
specify(expr(!!variable ~ !!by)) |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = test_name, na.rm = na.rm)
result_df <- get_confidence_interval(booted,
estimate,
type = "bias-corrected",
point_estimate = estimate) |>
rename(conf.low = lower_ci,
conf.high = upper_ci)
# returns a df with column names as specified in gtsummary docs
return(result_df)
}
This is the gtsummary code:
# Turn the group factor into character (per add_difference()) and reduce
# groups to 2.
pdata <- PlantGrowth |> filter((group == "trt1") | (group == "trt2")) |>
mutate(group = as.character(group))
table <- data |>
tbl_summary(
by = group,
statistic = list(all_continuous() ~ "{median} ({p25}β€”{p75})"),
) |>
add_difference(
test = list(weight ~ "testMe"),
# Here we pass a string containing the name of the estimate which
# the infer functions will need.
test.args = weight ~ list(test_name = "diff in medians")
# The following selection, which I've seen in documentation and on the
# Internet, doesn't work for me.
# test.args = all_tests("testMe) ~ list(test_name = "diff in medians")
)
In my attempts, the testMe function is called but its test_name argument is always NULL (i.e. default); all subsequent errors are due to its being blank, and if I set a sensible default, the function always returns the subsequent result.
When I change testMe not to have a default test_name, I get an "argument missing" error. In other words, test.args isn't working for me…
Am I getting something wrong? Thanks!

R Catboost to handle categorical variables

I have a question about Catboost. Whether do I preprocess the categorical before modeling?
If I have 86 variables including 1 target variable. In these 85 variables, there are 2 numeric variables and 83 categorical variables (Factor type). The target variable is binary factor, 1 or 0.
Column 1, and Column 4 to Column 85 are factors type.
Column 2 and 3 are numeric.
I am a little confused with cat_features in catboost.train(). In the parameters, I can set a vector of categorical features. Also, I can set in the catboost.load_pool.
library(Catboost)
library(dplyr)
X_train <- train %>% select(-Target)
y_train <- (as.numeric(unlist(train[c('Target')])) - 1)
X_valid <- test %>% select(-Target)
y_valid <- (as.numeric(unlist(test[c('Target')])) - 1)
train_pool <- catboost.load_pool(data = X_train, label = y_train, cat_features = c(0,3:84))
test_pool <- catboost.load_pool(data = X_valid, label = y_valid, cat_features = c(0,3:84))
params <- list(iterations=500,
learning_rate=0.01,
depth=10,
loss_function='RMSE',
eval_metric='RMSE',
random_seed = 1,
od_type='Iter',
metric_period = 50,
od_wait=20,
use_best_model=TRUE,
cat_features = c(0,3:84))
catboost.train(train_pool, test_pool, params = params)
However, after I ran the code above, I got an error:
Error in catboost.train(train_pool, test_pool, params = params) :
catboost/libs/options/plain_options_helper.cpp:339: Unknown option {cat_features} with value "[0,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84]"
Any help?
Look at this example cat_features should not go in param <- list() only in catboost.load_pool()
library(catboost)
countries = c('RUS','USA','SUI')
years = c(1900,1896,1896)
phone_codes = c(7,1,41)
domains = c('ru','us','ch')
dataset = data.frame(countries, years, phone_codes, domains, stringsAsFactors = T)
glimpse(dataset)
label_values = c(0,1,1)
fit_params <- list(iterations = 100,
loss_function = 'Logloss',
ignored_features = c(4,9),
border_count = 32,
depth = 5,
learning_rate = 0.03,
l2_leaf_reg = 3.5)
pool = catboost.load_pool(dataset, label = label_values, cat_features = c(0,3))
model <- catboost.train(pool, params = fit_params)
model
I haven't tried CatBoost in R, but see the example on this page:
https://catboost.ai/docs/concepts/r-reference_catboost-train.html
It appears you only pass the categorical variables in the load_pool() call, and NOT in the train() call.
(This works differently from the Python API, where cat_features is passed in the Python fit() call.)
A suggestion: group all the categorical variables in the left most column. That way you have a simpler vector creation. I also have a check in my code to make sure I did it right...

Self-made function works in test but not for my actual data set

I am working with functions. I wrote a function for Basal Area
ba <- function(dbh,na.rm) {
stopifnot(is.numeric(dbh))
answer <- dbh^2*(0.005454)
return(answer)
}
The function works with a test vector. Now I am trying to do some summaries of a dataset I have.
(copy and pasted directly from R)
plot.summary <- trees %>% group_by(MU, Plot, Inv) %>% summarize(year = first(Year), arithemtic.mean = my.mean(dbh, na.rm = TRUE), quadratic.mean = my.q.mean(dbh, na.rm = TRUE), var = my.var(dbh, na.rm = TRUE), n.trees = n())
(Modified spacing to read easier)
plot.summary <- trees %>% group_by(MU, Plot, Inv) %>%
summarize(year = first(Year), arithemtic.mean = my.mean(dbh, na.rm = TRUE),
quadratic.mean = my.q.mean(dbh, na.rm = TRUE), var = my.var(dbh, na.rm = TRUE),
n.trees = n())
When I run it is says
Error in summarise_impl(.data, dots) :
Column `basal.area` must be length 1 (a summary value), not 19
I am not sure why. The data set has only 18 columns.
My command works perfectly fine when I do not include the basal area part.
I am not sure what I might be missing
Thank you for any help!
The variables you refer to in the group_by function are not in the dataset trees, so I've taken some liberties to create a reproducible example that hopefully fits your needs.
Assuming you wanted to group by a variable like Height, here is a working example:
plot.summary <- trees %>%
group_by(Height) %>%
summarise(mean.basal.area = mean(ba(Girth)),
n.trees = n())
In the above, your function ba is wrapped in mean. This results in a mean basal area for the set of values of Girth that share the same Height.
Is that the kind of thing you want?

Passing strings into 'contrasts' argument of lme/lmer

I am writing a sub-routine to return output of longitudinal mixed-effects models. I want to be able to pass elements from lists of variables into lme/lmer as the outcome and predictor variables. I would also like to be able to specify contrasts within these mixed-effects models, however I am having trouble with getting the contrasts() argument to recognise the strings as the variable names referred to in the model specification within the same lme/lme4 call.
Here's some toy data,
set.seed(345)
A0 <- rnorm(4,2,.5)
B0 <- rnorm(4,2+3,.5)
A1 <- rnorm(4,6,.5)
B1 <- rnorm(4,6+2,.5)
A2 <- rnorm(4,10,.5)
B2 <- rnorm(4,10+1,.5)
A3 <- rnorm(4,14,.5)
B3 <- rnorm(4,14+0,.5)
score <- c(A0,B0,A1,B1,A2,B2,A3,B3)
id <- rep(1:8,times = 4, length = 32)
time <- factor(rep(0:3, each = 8, length = 32))
group <- factor(rep(c("A","B"), times =2, each = 4, length = 32))
df <- data.frame(id = id, group = group, time = time, score = score)
Now the following call to lme works just fine, with contrasts specified (I know these are the default so this is all purely pedagogical).
mod <- lme(score ~ group*time, random = ~1|id, data = df, contrasts = list(group = contr.treatment(2), time = contr.treatment(4)))
The following also works, passing strings as variable names into lme using the reformulate() function.
t <- "time"
g <- "group"
dv <- "score"
mod1R <- lme(reformulate(paste0(g,"*",t), response = "score"), random = ~1|id, data = df)
But if I want to specify contrasts, like in the first example, it doesn't work
mod2R <- lme(reformulate(paste0(g,"*",t), response = "score"), random = ~1|id, data = df, contrasts = list(g = contr.treatment(2), t = contr.treatment(4)))
# Error in `contrasts<-`(`*tmp*`, value = contrasts[[i]]) : contrasts apply only to factors
How do I get lme to recognise that the strings specified to in the contrasts argument refer to the variables passed into the reformulate() function?
You should be able to use setNames() on the list of contrasts to apply the full names to the list:
# Using a %>% pipe so need to load magrittr
library(magrittr)
mod2R <- lme(reformulate(paste0(g,"*",t), response = "score"),
random = ~1|id,
data = df,
contrasts = list(g = contr.treatment(2), t = contr.treatment(4)) %>%
setNames(c(g, t))
)

R: Predictions from a list of coxph objects on newdata

I am building a series of Cox regression models, and getting predictions from those models on new data. I am able to get the expected number of events in some cases, but not others.
For example, if the formula in the coxph call is written out, then the predictions are calculated. But, if the the formula is stored in an object and that object called, I get an error. I also cannot get the predictions if I try to create them within a dplyr piped mutate function (for the function I am writing, this would be the most ideal place to get the predictions to work properly).
Any assistance is greatly appreciated!
Thank you,
Daniel
require(survival)
require(tidyverse)
n = 15
# creating tibble of tibbles.
results =
tibble(id = 1:n) %>%
group_by(id) %>%
do(
# creating tibble to evaluate model on
tbl0 = tibble(time = runif(n), x = runif(n)),
# creating tibble to build model on
tbl = tibble(time = runif(n), x = runif(n))
) %>%
ungroup
#it works when the formula is added the the coxph function already written out
map2(results$tbl, results$tbl0, ~ predict(coxph( Surv(time) ~ x, data = .x), newdata = .y, type = "expected"))
#but if the formula is previously defined, I get an error
f = as.formula(Surv(time) ~ x)
map2(results$tbl, results$tbl0, ~ predict(coxph( f, data = .x), newdata = .y, type = "expected"))
# I also get an error when I try to include in a dplyr pipe with mutate
results %>%
mutate(
pred = map2(tbl, tbl0, ~ predict(coxph( f, data = .x), newdata = .y, type = "expected"))
)
I figured it out (with the help of a friend). If you define the formula as a string, and within the function call coerce it to a formula everything runs smoothly. I am not sure why it works, but it does!
#define the formula as a string, and call it in the function with as.formula(.)
f = "Surv(time) ~ x"
map2(results$tbl, results$tbl0, ~ predict(coxph( as.formula(f), data = .x), newdata = .y, type = "expected"))
#also works in a dplyr pipe with mutate
results %>%
mutate(
pred = map2(tbl, tbl0, ~ predict(coxph( as.formula(f), data = .x), newdata = .y, type = "expected"))
)

Resources