R gtsummary package: How to Manipulate / Hide Rows in Summary Table - r

I am working on a project with gtsummary. For one of the tables, I have to build a long table listing covariables before and after the matchit process.
My issue is that for all of the covariables (Obesity, for example), it reads one row, Obesity, then next row, Obese, and then the next, Not Obese. That is three tables for which I wish to only show one: Diabetes N (%).
I have tried editing dichotomous variables, introducing Null, trying to find a row_hide function, but to no avail.
Here is my code:
Creation of trial
trialCAS1 <- index_CAS %>%
select(TopDecile, Gender, Obesity, Diabetes, Diabetes_Complex, etc)
Tbl summary
CAStable1 <- tbl_summary(trialCAS1,
by = TopDecile,
missing = "no") %>%
add_n() %>%
modify_header(label = "**Variable**") %>%
bold_labels()
I included the first table I get.

The tbl_summary() function tries its best to guess the type of data passed (categorical, dichotomous, and continuous). It doesn't always guess what we'd like to see, but the default can always be changed using arguments in tbl_summary()! I'll use the trial data set in the {gtsummary} package as an example.
Here is the default output:
library(gtsummary)
trial %>%
select(trt, grade, stage) %>%
tbl_summary(by = trt)
By default, the summary statistics for grade and stage are shown on multiple rows. Imagine, however, we are only interested in the rate of Grade I disease and the rate of Stage T1 cancer. We can use the tbl_summary(value=) argument to specify these are the only values we want displayed (which will then default to print these as dichotomous variables). In the example below, I have also updated the label displayed to indicate these are Grade I and Stage T1 rates only.
trial %>%
select(trt, grade, stage) %>%
tbl_summary(
by = trt,
value = list(grade ~ "I",
stage ~ "T1"),
label = list(grade ~ "Grade I",
stage ~ "Stage T1")
)
Based on what I see from your code and output, I think this code will work for you on your data set:
tbl_summary(
trialCAS1,
by = TopDecile,
missing = "no".
value = Obesity ~ "Obese",
label = Obesity ~ "Obese"
)

Related

lm function is giving a warning that it is dropping rows

This is my question
Do the developers that make more games charge higher prices?
my code:
dev_data <- steam_data_final %>%
group_by(developer) %>%
summarize(num_dev = n(), avg_price = mean(price, na.rm = TRUE)) %>%
arrange(desc(num_dev))
dev_data
but this model isn't working, getting Warning: Dropping 3038 rows with missing values
mod_dev <- lm(num_dev ~ avg_price, data = dev_data)
Check if you have any NA using summary() or is.na() for each column. If you do have any NA, then it is the reason why the lm() gives you the warning message.
Also, it seems like you need to use lm(avg_price ~ num_dev, data = dev_data) instead of lm(num_dev ~ avg_price, data = dev_data). It seems like the dependent variable should be avg_price, not num_dev. (It depends on your question of research.)

Tidymodels. step_impute_linear(), can it be used when every column contains NAs

My data contain >100 columns and every one of them contains NA's, and when I try to use step_impute_linear() it returns a mistake
Warning message:
There were missing values in the predictor(s) used to impute;
imputation did not occur.
Can, I, somehow make it work?
I think you'll need to use at least two steps of imputation.
First you will need to choose some variables to impute with something very simple, like the median or mode. I would choose the variables with lower rates of missingness for this.
Next you can choose some variables to impute with linear models, using only complete variables (the ones you imputed first with, say, the median). I would choose variables with higher rates of missingness for this, I think.
Here is an example analysis where I took this approach:
bb_rec <-
recipe(is_home_run ~ launch_angle + launch_speed + plate_x + plate_z +
bb_type + bearing + pitch_mph +
is_pitcher_lefty + is_batter_lefty +
inning + balls + strikes + game_date,
data = bb_train
) %>%
step_date(game_date, features = c("week"), keep_original_cols = FALSE) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_impute_median(all_numeric_predictors(), -launch_angle, -launch_speed) %>%
step_impute_linear(launch_angle, launch_speed,
impute_with = imp_vars(plate_x, plate_z, pitch_mph)
) %>%
step_nzv(all_predictors())
If you want to try out different strategies for types of imputation, I suggest setting up workflowsets and test on resampling folds.

Treatment effect table in R with horizontally-oriented variables

So i subsetted a dataframe to keep only my 4 columns of interest. I want to count the number of control (0) and treated (1) observations. I computed something with the gtsummary package, but the variables are vertically oriented (like here http://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html), one below each other, and this is not what i want. I searched on google but all the tables i saw have this orientation too.
I put here a picture of what i'd like to obtain, if some of you have any idea !
code i use to obtain my initial table (same as in the link)
install.packages("gtsummary")
library(gtsummary)
trial <- finaldf %>% select(treatment, 2digID,4digID,classificationsdescription)
trial %>% tbl_summary()
t2 <- trial %>% tbl_summary(by = treatment)
I cannot put the real data but i created an example that looks like my data :
_2ID <- c(38,38,38,38,38,38,38,38,38,38,80,80,80,80,80,80,80,80,80,80)
_4ID <- c(3837,3837,3837,3812,3812,3896,3894,3894,3877,3877, 8099,8099,8027,8027,8027,8033,8033,8064,8064,8022)
descriptions <- c('ILL1','ILL1','ILL1', 'ILL2','ILL2','ILL3','ILL4','ILL4','ILL5','ILL5','ILL1','ILL1','ILL2','ILL2','ILL2','ILL3','ILL3','ILL4','ILL4','ILL5')
trt <-c(0,0,0,1,1,1,0,0,1,1,0,0,1,1,1,0,0,1,1,0)
df.data <- data.frame(_2ID,_4ID,descriptions, trt)
UPDATE - SOLVED
I think i managed to solve this problem even if my output is a dataframe and not a "publication-ready" table :
install.packages("reshape2")
library(reshape2)
data_wide <- dcast(df,_2digID+_4digID+descriptions ~ treatment, value.var="counts")
But i'm not sure yet that this gives the right numbers tho.
The example below gets you close, but not exactly what you're after. I like the idea of being able to support tables like this, and I'll add it to the list of features to implement!
library(gtsummary)
#> #Uighur
packageVersion("gtsummary")
#> [1] '1.4.1'
tbl <-
trial %>%
mutate(
grade = paste("Grade", as.character(grade)),
stage = paste("Stage", as.character(stage))
) %>%
tbl_strata(
strata = c(stage, grade),
~ .x %>%
tbl_summary(by = trt,
include = response,
type = response ~ "categorical",
missing = "no",
statistic = response ~ "{n}") %>%
modify_header(all_stat_cols() ~ "**{level}**"),
.combine_with = "tbl_stack"
) %>%
as_flex_table()
Table truncated because it was very long!
Created on 2021-07-14 by the reprex package (v2.0.0)

How do I set which level is the "event" in my outcome variable using tidymodels?

I am using tidymodels for machine learning and I want to predict a binary response/outcome. How do I specify which level of the outcome is the "event" or positive case?
Does this happen in the recipe, or somewhere else?
##split the data
anxiety_split <- initial_split(anxiety_df, strata = anxiety)
anxiety_train <- training(anxiety_split)
anxiety_test <- testing(anxiety_split)
set.seed(1222)
anxiety_cv <- vfold_cv(anxiety_train, strata = anxiety)
anxiety_rec <- recipe(anxiety ~ ., data = anxiety_train, positive = 'pos') %>%
step_corr(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_numeric()) %>%
step_normalize(all_numeric())
You don't need to set which level of your outcome variable is the "event" until it is time to evaluate your model. You can do this using the event_level argument of most yardstick functions. For example, check out how to do this for yardstick::roc_curve():
library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.
library(tidyverse)
data(two_class_example)
## looks good!
two_class_example %>%
roc_curve(truth, Class1, event_level = "first") %>%
autoplot()
## YIKES!! we got this backwards
two_class_example %>%
roc_curve(truth, Class1, event_level = "second") %>%
autoplot()
Created on 2020-08-02 by the reprex package (v0.3.0.9001)
Notice the message on startup for yardstick; the first factor level is assumed to be the event. This is similar to how base R acts. You only need to worry about event_level if your "event" is not the first factor level.

error in add_p()' for variable X and test 'fisher.test', p-value omitted

I get the error below when I try to use the add_p() function to get a p-value for differences between my by variable (with 10 levels) and a categorical variable with two levels (yes/no). I am not sure how to provide a reproducible example. From the trials data, I imagine my by variable would be the "T Stage" variable with 10 levels, and the categorical variables would be: (1) "Chemotherapy Treatment" with 2 levels, and (2) "Chemotherapy Treatment2" with 4 levels. But here is the code I ran.
library(gtsummary)
library(tidyverse)
miro_def %>%
select(mheim, age_dx, time_t1d_yrs, gender, collard, fhist_pandz) %>%
tbl_summary(by = mheim, missing = "no",
type = list(c(gender, collard, fhist_pandz, mheim) ~ "categorical"),
label = list(gender ~ "Gender",
fhist_pandz ~ "Family history of PD",
age_dx ~ "Age at diagnosis",
time_t1d_yrs ~ "Follow-up(years)")) %>%
add_p() %>%
# style the output with custom header
#modify_header(stat_by = "{level}") %>%
# convert to kableExtra as_kable_extra(booktabs = TRUE) %>%
# reduce font size to make table fit. # you may also use the `latex_options = "scale_down"` argument here.
kable_styling(font_size = 7, latex_options = "scale_down")
However, I do get a p-value for this by variable (10 levels) with other variables (which are continous/numeric)
How can I fix this error?
In the case where I have the mentioned multilevel by variable and a multilevel (>2 levels) categorical variable, is there something special I should do to get a p-value?
There was an error in 'add_p()' for variable 'gender' and test 'fisher.test', p-value omitted:
Error in stats::fisher.test(data[[variable]], as.factor(data[[by]])): FEXACT error 7(location). LDSTP=18540 is too small for this problem,
(pastp=51.2364, ipn_0:=ipoin[itp=150]=215, stp[ipn_0]=40.6787).
Increase workspace or consider using 'simulate.p.value=TRUE'
There was an error in 'add_p()' for variable 'collard' and test 'fisher.test', p-value omitted:
Error in stats::fisher.test(data[[variable]], as.factor(data[[by]])): FEXACT error 7(location). LDSTP=18570 is too small for this problem,
(pastp=37.0199, ipn_0:=ipoin[itp=211]=823, stp[ipn_0]=23.0304).
Increase workspace or consider using 'simulate.p.value=TRUE'
There was an error in 'add_p()' for variable 'fhist_pandz' and test 'fisher.test', p-value omitted:
Error in stats::fisher.test(data[[variable]], as.factor(data[[by]])): FEXACT error 7(location). LDSTP=18570 is too small for this problem,
(pastp=36.4614, ipn_0:=ipoin[itp=58]=1, stp[ipn_0]=31.8106).
Increase workspace or consider using 'simulate.p.value=TRUE'
since nobody posted an answer, here's what I used when coming across this. Following the Examples given in the help file ?gtsummary::add_p.tbl_summary, I composed a custom function that runs fisher.test with the simulate.p.values = TRUE option:
## define custom test
fisher.test.simulate.p.values <- function(data, variable, by, ...) {
result <- list()
test_results <- stats::fisher.test(data[[variable]], data[[by]], simulate.p.value = TRUE)
result$p <- test_results$p.value
result$test <- test_results$method
result
}
## add p-values to your gtsummary table, using custom test defined above
summary_table %>%
add_p(
test = list(all_categorical() ~ "fisher.test.simulate.p.values") # this applies the custom test to all categorical variables
)
You can also amend the number of iterations for computing the simulated p-values by changing the default B = 2000 parameter to fisher.test() above.
All this assumes, of course, that it's appropriate to use Fisher's test in the first place.
Since it fixed the issue for me, I would like to indicate that since version 1.3.6 of gtsummary there is an option in add_p() with which you can specify arguments to the test functions (i.e. test.args). Thank you to the developers for this!
From the NEWS:
Each add_p() method now has the test.args = argument. Use this argument to pass
additional arguments to the statistical method, e.g.
add_p(test = c(age, marker) ~ "t.test",
test.args = c(age, marker) ~ list(var.equal = TRUE))
It is also explained in the add_p() help (i.e. ?add_p).
I had a similar problem. You have to increase your workspace with test.args within add_p().
miro_def %>%
select(mheim, age_dx, time_t1d_yrs, gender, collard, fhist_pandz) %>%
tbl_summary(by = mheim, missing = "no",
type = list(c(gender, collard, fhist_pandz, mheim) ~ "categorical"),
label = list(gender ~ "Gender",
fhist_pandz ~ "Family history of PD",
age_dx ~ "Age at diagnosis",
time_t1d_yrs ~ "Follow-up(years)")) %>%
add_p(test.args = variable_with_no_pval ~ list(workspace=2e9))
or
add_p(test.args = all_test("fisher.test") ~ list(workspace=2e9))

Resources