How to use tidy evaluation with a non-tidyverse package - r

I am trying to develop a function which will calculate the means, standard error, and confidence intervals of some survey data. I need to do this repeatedly over a number of different variables with a bunch of different filter statements.
DATA
df <- data.frame(address_id = rep(c(1,1,1,2,2,2,3,3,3,4,4,4),5),
person_id = rep(c(1,2,3),20),
sex = as.factor(rep(c("male","female"),30)),
response_var = as.factor(rep(seq(1,6,1))),
weight = runif(60, 50, 200))
Example that works without function
# create survey design
design <- survey::svydesign(data = df,
strata = ~ address_id,
id = ~ person_id,
nest = TRUE,
weights = ~ weight)
# calcualte weighted mean
mean_se <- survey::svymean(~sex, design)
# calculate confidence intervals
ci <- survey::confint(df_mean)
My function
create_mean_and_cis <- function(data, var){
design <- survey::svydesign(data = data,
strata = ~ address_id,
id = ~ person_id,
nest = TRUE,
weights = ~ weight)
mean_se <- survey::svymean(~{{var}}, design)
ci <- confint(mean_se)%>%
tibble::as_tibble()%>%
tibble::rownames_to_column("variable")
output <- mean_se%>%
tibble::as_tibble()%>%
tibble::rownames_to_column("variable")%>%
dplyr::left_join(ci)
return(output)
}
# function call
create_mean_and_cis(sex)
When I run, I get an error message saying:
Error in survey::svydesign(data = data, strata = ~address_id, id = ~person_id, :
object 'sex' not found
I can't understand what is going wrong. The tidy evaluation works perfectly when I use the curly-curly "{{var}}" within other functions. Why doesn't it work here? Can anyone help?
I have tried several variations of quasiquotation including: !!enquo(sex), sym(sex), !!sym(sex), {{sex}}, eval(parse(sex)). None of which have yielded working results.

The actual error you are getting is because you aren't passing the data argument to create_mean_and_cis ( you are doing create_mean_and_cis(sex) whereas it should be create_mean_and_cis(df, sex)).
However, this will not on its own fix the problem, since the curly-curly operator won't work inside a formula. Instead, you need to do something like:
create_mean_and_cis <- function(data, var){
var <- deparse(substitute(var))
design <- survey::svydesign(data = data,
strata = ~ address_id,
id = ~ person_id,
nest = TRUE,
weights = ~ weight)
mean_se <- survey::svymean(as.formula(paste('~', var)), design)
ci <- confint(mean_se)%>%
tibble::as_tibble()%>%
tibble::rownames_to_column("variable")
output <- mean_se%>%
tibble::as_tibble()%>%
tibble::rownames_to_column("variable")%>%
dplyr::left_join(ci)
return(output)
}
Which allows
create_mean_and_cis(df, sex)
#Joining, by = "variable"
## A tibble: 2 x 5
# variable mean SE `2.5 %` `97.5 %`
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.481 0.163 0.161 0.802
# 2 2 0.519 0.163 0.198 0.839

If you wanted to use more advanced tidy evaulation features, you can use rlang::eval_tidy
mean_se <- rlang::eval_tidy(rlang::quo(survey::svymean(~{{var}}, design)))
You wrap your expression in a quosure where you can use the {{ }} and !! style syntax. Both quo and eval_tidy will recognize this syntax. These features are unique to functions that use the rlang package; you cannot use them directly any arbitrary R function without some sort of rlang wrapper.
Or per #Lionel's suggestion, you can use rlang::inject to build the formula and pass that to svymean
formula <- rlang::inject(~!!substitute(var))
mean_se <- survey::svymean(formula, design)

Related

Removing certain parts of modelsummary in R at specific statistics

I am using gamm4:gamm4 to model longitudinal change.
I am trying to use the modelsummary::modelsummary function to create an output table of the following results:
I would like to add t-values and std.error to the output of the fixed effects, and remove the empty tags values from the random effects
model_lmer <- gamm4(Y ~ Tract + s(Age, by = Tract, k = 10) + Sex,
data = (DF1),
random = ~ (0 + Tract | ID))
modelsummary(model_lmer$mer,
statistic = c("s.e. = {std.error}",
"t = {statistic}"))
But I am struggling to write the correct syntax to remove the "t" and "s.e." from the random effects output.
This is kind of tricky, actually. The issue is that modelsummary()
automatically drops empty rows when they are filled with NA or an
empty string "". However, since glue strings can include arbitrary
text, it is hard to think of a general way to figure out if the row is
empty or not, because modelsummary() cannot know ex ante what
constitutes an empty string.
If you have an idea on how this check could be implemented, please report it
on Github:
https://github.com/vincentarelbundock/modelsummary
In the meantime, you could use the powerful tidy_custom.CLASSNAME
mechanism
to customize the statistic and p.value statistics directly instead
of through a glue string:
library(gamm4)
library(modelsummary)
# simulate
x <- runif(100)
fac <- sample(1:20,100,replace=TRUE)
eta <- x^2*3 + fac/20; fac <- as.factor(fac)
y <- rpois(100,exp(eta))
# fit
mod <- gamm4(y~s(x),family=poisson,random=~(1|fac))
# customize
tidy_custom.glmerMod <- function(x) {
out <- parameters::parameters(x)
out <- insight::standardize_names(out, style = "broom")
out$statistic <- sprintf("t = %.3f", out$statistic)
out$p.value <- sprintf("p = %.3f", out$p.value)
out
}
# summarize
modelsummary(mod$mer,
statistic = c("{statistic}", "{p.value}"))
Model 1
X(Intercept)
1.550
t = 17.647
p = 0.000
Xs(x)Fx1
0.855
t = 4.445
p = 0.000
Num.Obs.
100
RMSE
2.49
Note that I used simple glue strings in statistic = "{p.value}", otherwise they would be wrapped up in parentheses, as is default for standard errors.

Dynamic variable names in svydesign from survey package

I want to add columns to a survey.design created with the survey package, which can be done as following:
library(survey)
data(api)
dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
dclus2 <- transform(dclus1,
api00_b = api00 + 1)
svymean(~ api00, design = dclus2)
#> mean SE
#> api00 644.17 23.542
svymean(~ api00_b, design = dclus2)
#> mean SE
#> api00_b 645.17 23.542
For a more complex task, I need to create these variable names dynamically from external vectors. The following produces an error, but I think provides an illustration of what I want to achieve:
vars <- c("api00_a", "api00_b")
dclus2 <- transform(dclus1,
vars[[2]] = api00 + 1)
How could dynamic names for the new columns be implemented?
I don't think you can use a vector like this on the left-hand side of the equal sign in R. You don't have to use transform, which calls survey:::update.survey.design, though. You could just add your new variable directly:
dclus2 <- dclus1
dclus2$variables[ ,vars[[1]]] <- dclus2$variables[,"api00"] + 1
This is the same as creating the new variable before converting to a survey.design object, as long as you do not use any survey functions for creation of the new variable. Just using Anthony's comment:
apiclus2 <- apiclus1
apiclus2[ , vars[[1]]] <- apiclus2[ , "api00" ] + 1
dclus_prep_2 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus2, fpc = ~fpc)
You might prefer to use srvyr, which allows your kind of programming with dplyr's !!and :=:
library(srvyr)
dclus_srvyr_1 <- as_survey_design(.data = apiclus1,
ids = dnum,
weights = pw,
fpc = fpc)
dclus_srvyr_2 <- mutate(dclus_srvyr_1,
!!vars[[1]] := api00 + 1)
All versions have the same result:
lapply(list(dclus2, dclus_prep_2, dclus_srvyr_2),
function(design) svymean(~api00_a, design=design))
[[1]]
mean SE
api00_a 645.17 23.542
[[2]]
mean SE
api00_a 645.17 23.542
[[3]]
mean SE
api00_a 645.17 23.542
Here's a possible solution using purrr:
library(purrr)
vars <- c("api00_a", "api00_b")
transform_func <- function(data, vars) {
transform(data, vars = api00 + 1)
}
map(vars, ~transform_func(dclus1, .))
Which gives us the following list:
[[1]]
1 - level Cluster Sampling design
With (15) clusters.
update(`_data`, ...)
[[2]]
1 - level Cluster Sampling design
With (15) clusters.
update(`_data`, ...)
You can do this with bquote. For example
vars <- c("api00_plus_1", "api00_plus_2")
exprs<-list(quote(api00+1),quote(api00+2))
names(exprs)<-vars
bquote(update(dclus1,..(exprs)), splice=TRUE)
eval(bquote(update(dclus1,..(exprs)), splice=TRUE))
Here's another chunk from inside the survey package that converts any string variables mentioned in a formula to factor
strings_to_factors<-function(formula, design){
allv<-intersect(all.vars(formula), colnames(design))
vclass<-sapply(model.frame(design)[,allv,drop=FALSE], class)
if (!any(vclass=="character")) return(design)
vfix<-names(vclass)[vclass=="character"]
l<-as.list(vfix)
names(l)<-vfix
fl<-lapply(l, function(li) bquote(factor(.(as.name(li)))))
expr<-bquote(update(design, ..(fl)), splice=TRUE)
eval(expr)
}

tbl_uvregression gives Error: C stack usage is too close to the limit

I have a df with 2,946 obs and 600 variables.
I want to produce a table of univariate regression models for 599 variables from the dataset. To do this, I am using the tbl_uvregression() function from the 'gtsummary' package.
Here's my code:
RAPOA_labelled[,-1] %>% #remove ID column
tbl_uvregression(
method = glm,
y = GIR.2cat, #dependent variable
method.args = list(family = binomial),
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 3)
) %>%
add_nevent() %>% # add number of events of the outcome
bold_p() %>% # bold p-values under a given threshold (default 0.05)
bold_labels()
Everytime it is run, I get te following error:
Error: C stack usage 7971168 is too close to the limit.
My Cstack_info() is:
> Cstack_info()
size current direction eval_depth
7969177 12800 1 2
EDIT
As final output, I needed a table with the estimate, std.error, pvalue, odds ratio and confident interval for each variable in the data frame. tbl_regression did not works fine for me, so, finally I do it with a loop.
I’ll leave the code here in case it serves anyone.
name <- colnames(datos_rapoa_gir[,-c(1:2)]) # to remove ID and outcome columns
term <- {}
B <- {}
SE <- {}
pvalue <- {}
OR <- {}
lowIC <- {}
highIC <- {}
for (i in seq_along(name)) {
mod_formula <- as.formula(sprintf("GIR.2cat ~ %s", name[i]))
mod <- glm(formula = mod_formula, family = "binomial", data = datos_rapoa_gir, na.action = na.omit)
term <- c(term, broom::tidy(mod)$term)
B <- c(B, broom::tidy(mod)$estimate)
SE <- c(SE, broom::tidy(mod)$std.error)
pvalue <- c(pvalue, broom::tidy(mod)$p.value)
OR <- c(OR, exp(mod$coefficients))
lowIC <- c(lowIC, exp(confint(mod))[,1])
highIC <- c(highIC, exp(confint(mod))[,2])
}
univars <- data.frame(variable = term, B = B, SE = SE, pvalue = pvalue, OR = OR, LowIC = lowIC, HighIC = highIC) %>%
remove_rownames()
A tbl_uvregression() object can become large (containing the full data, the model object, etc.), and it looks like your machine's memory can't handle it.
What do you want the output to look like? It sounds like you're going to end up with a table with 600 rows (likely more if you have categorical covariates).
Here are some steps you can take to reduce the size:
The tbl_uvregression() object is a helper function that iterates over the columns in a data frame, builds a model for each column, calls tbl_regression() on each model, and combines all the tables with tbl_stack(). Rather than using the helper function, follow these steps yourself to reduce the size. After you call tbl_regression(), you can further reduce the size of the tbl_regression() object using the tbl_butcher().
Happy Programming!

Multiple imputation and mlogit for a multinomial regression

I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".
So making the data
library(mlogit)
library(nnet)
library(tidyverse)
library(mice)
df <- data.frame(vax = sample(1:6, 500, replace = T),
age = runif(500, 12, 18),
var1 = sample(1:2, 500, replace = T),
var2 = sample(1:5, 500, replace = T))
# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp
df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)
# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)
It works using nnet's multinom:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))
But throws up an error when I try to reshape the data into mlogit format
test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))
Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?
Answer
You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2,
data = data.idx,
reflevel = "1",
nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
How with.mids works
When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.
If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:
It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.
The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.
The problems
This is where both your problems derive from:
In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.
Fixing multinom
The solution for your multinom line is simple: just don't specify data:
multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))
As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.
Fixing dfidx (and mlogit)
We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:
analyses <- lapply(seq_len(df4$m), function(i) {
data.i <- complete(df4, i)
data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
From there, all we have to do is make something that looks like a mira object, so that we can still use pool:
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))
Offering this as a way forward to circumvent the error with dfidx():
df5 <- df4$imp %>%
# work with a list, where each top-element is a different imputation run (imp_n)
map(~as.list(.x)) %>%
transpose %>%
# for each run, impute and return the full (imputed) data set
map(function(imp_n.x) {
df_out <- df4$data
df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
df_out$age[is.na(df_out$age)] <- imp_n.x$age
df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
return(df_out)
}) %>%
# No errors with dfidx() now
map(function(imp_n.x) {
dfidx(data = imp_n.x, choice = "vax", shape = "wide")
})
However, I'm not too familiar with mlogit(), so can't help beyond this.
Update 8/2/21
As #slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:
# To fit mlogit() for each imputed data set
df5 %>%
map(function(imp_n.x) {
# form as specified in the comments
mlogit(vax ~ 1 | age + var1 + var2,
data = imp_n.x,
reflevel = "1",
nests = list(type1 = c('1', '2'),
type2 = c('3','4'),
type3 = c('5','6')))
})

R order lapply output from a function with multiple outputs by variable (column) rather than by function

I have a function in R which includes multiple other functions, including a custom one. I then use lapply to run the combined function across multiple variables. However, when the output is produced it is in the order of
function1: variable a, variable b, variable c
function2: variable a, variable b, variable c
When what I would like is for it to be the other way around:
variable a: function 1, function 2...
variable b: function 1, function 2...
I have recreated an example below using the mtcars dataset, with number of cylinders as a predictor variable, and vs and am as outcome variables.
library(datasets)
library(tidyverse)
library(skimr)
library(car)
data(mtcars)
mtcars_binary <- mtcars %>%
dplyr::select(cyl, vs, am)
# logistic regression function
logistic.regression <- function(logmodel) {
dev <- logmodel$deviance
null.dev <- logmodel$null.deviance
modelN <- length(logmodel$fitted.values)
R.lemeshow <- 1 - dev / null.dev
R.coxsnell <- 1 - exp ( -(null.dev - dev) / modelN)
R.nagelkerke <- R.coxsnell / ( 1 - ( exp (-(null.dev / modelN))))
cat("Logistic Regression\n")
cat("Hosmer and Lemeshow R^2 ", round(R.lemeshow, 3), "\n")
cat("Cox and Snell R^2 ", round(R.coxsnell, 3), "\n")
cat("Nagelkerke R^2" , round(R.nagelkerke, 3), "\n")
}
# all logistic regression results
log_regression_tests1 <- function(df_vars, df_data) {
glm_summary <- glm(df_data[,df_vars] ~ df_data[,1], data = df_data, family = binomial, na.action = "na.omit")
glm_print <- print(glm_summary)
log_results <- logistic.regression(glm_summary)
blr_coefficients <- exp(glm_summary$coefficients)
blr_confint <- exp(confint(glm_summary))
list(glm_summary = glm_summary, glm_print = glm_print, log_results = log_results, blr_coefficients = blr_coefficients, blr_confint = blr_confint)
}
log_regression_results1 <- sapply(colnames(mtcars_binary[,2:3]), log_regression_tests1, mtcars_binary, simplify = FALSE)
log_regression_results1
When I do this, the output is being produced as:
glm_summary: vs, am
log_results: vs, am
etc. etc.
When what I would like for the output to be ordered is:
vs: all function outputs
am: all function outputs
In addition, when I run this line of code, log_regression_results1 <- sapply(colnames(mtcars_binary[,2:3]), log_regression_tests1, mtcars_binary, simplify = FALSE) I get only the results of the logistic regression function, but when I print the overall results log_regression_results1 I get the remaining output, could anyone explain why?
Finally, the glm_summary function is not producing all of the output which it should. When I run the functions independently on a single variable, like so
glm_vs <- glm(vs ~ cyl, data = mtcars_binary, family = binomial, na.action = "na.omit")
summary(glm_vs)
logistic.regression(glm_vs)
exp(glm_vs$vs)
exp(confint(glm_vs))
it also produces the standard error, z value, and p value for summary(glm_vs) which it does not do embedded in the function, even though I have ```glm_print <- print(glm_summary)' included. Is there a way to get the output for the full summary function within the log_regression_tests1 function?
when I run your code up to log_regression_results1 I got exactly what you ask for:
summary(log_regression_results1)
Length Class Mode
vs 5 -none- list
am 5 -none- list
maybe you meant to ask the other way round?

Resources