I'm trying to create a Table 1 for NHANES survey data, first stratified by a binary variable for obese vs non-obese, then stratified again by a binary variable for control/trt group status ("wlp_yn"). I want to get counts (%) for categorical characteristics and means (SE) for continuous baseline characteristics. For these counts and means, I am trying to get p-values as well.
I've tried using tbl_svysummary(), svyby(), tbl_strata(), and CreateTableOne() without any success.
In the code below, I subset the full dataset into a smaller dataset of only control group data ("obese_adults") to divide up the table first. I am also starting out with age for the characteristics ("age_group" is categorical version of "RIDAGEYR" continuous variable). I couldn't figure it out, but I'm curious if there's another way to code this?
add_p_svysummary_ex1 <-
obese_adults %>%
tbl_svysummary(by = wlp_yn, percent = "row", include = c(age_group, RIDAGEYR),
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>%
add_p()
add_p_svysummary_ex1
svyby(~RIDAGEYR, ~age_group+wlp_yn, obese_adults, svymean) # avg age of each age group
Thanks in advance! Would really appreciate any help.
Edit: This is a simplified version of the code for reproducibility
# DEMO
demo <- nhanes('DEMO')
demo_vars <- names(demo)
demo2 <- nhanesTranslate('DEMO', demo_vars, data = demo)
# PRESCRIPTION MEDICATIONS
rxq_rx <- nhanes('RXQ_RX')
rxq_rx_vars <- names(rxq_rx)
rxq_rx2 <- nhanesTranslate('RXQ_RX', rxq_rx_vars, data = rxq_rx)
rxq_rx2 <- rxq_rx2 %>% select("SEQN", "RXD240B") %>% filter(!is.na(RXD240B)) %>% group_by(SEQN) %>% dplyr::summarise(across(everything(), ~toString(na.omit(.))))
nhanesAnalysis = join_all(list(demo2, rxq_rx2), by = "SEQN", type = "full")
# Reconstructing survey weights for combining 1999-2018 - Combining ten survey cycles (twenty years)
nhanesAnalysis$wtint20yr <- ifelse(nhanesAnalysis$SDDSRVYR %in% c(1,2), (2/10 * nhanesAnalysis$WTINT4YR), # for 1999-2002
(1/10 * nhanesAnalysis$WTINT2YR)) # for 2003-2018
# sample weights
nhanesDesign <- svydesign(id = ~SDMVPSU,
strata = ~SDMVSTRA,
weights = ~wtint20yr,
nest = TRUE,
data = nhanesAnalysis)
# subset
obese_adults <- subset(nhanesDesign, (obesity == 1 & !is.na(BMXBMI) & RIDAGEYR >= 60))
Is this what you are looking for. A double dummy split:
library(gtsummary)
library(tidyverse)
data(mtcars)
mtcars %>%
select(am, cyl, hp, vs) %>%
dplyr::mutate(
vs = factor(vs, labels = c("Obese", "Non-Obese")),
am = factor(am, labels = c("Control", "Treatment")),
cyl = paste(cyl, "Cylinder")
) %>%
tbl_strata(
strata = vs,
~.x %>%
tbl_summary(
by = am,
type = where(is.numeric) ~ "continuous"
) %>%
modify_header(all_stat_cols() ~ "**{level}**")
)
I'm not sure why you like to use tbl_svysummary() here, it's made for survey weights.
Related
I'm trying to create a stratified table using a categorial variable ("DOMINIO") in the argument "strata" and another categorical ("P601A") in the argument "by", then I use three numerical variables ("I601B2", "I601D2" and "I601Z2") in argument "include". I want to use a svydesign object since I want the output to be weighted. Nontheless, it just won't work and I'm not really sure why.
#Reading the data
load(url("https://github.com/cesarpoggi/PRUEBA/blob/main/PRUEBA_STACKOVERFLOW.rda?raw=true"))
#Setting the survey object
dessin2<- svydesign(id = ~1,
data = PANxFAM,
weight = ~FACTOR07)
#Creating stratified table
tbl <- dessin2 %>% tbl_strata(
strata = DOMINIO,
.tbl_fun = ~ .x %>% tbl_svysummary(
by = P601A,
include = c(I601B2, I601D2, I601Z2)))
For some reason it just won't work. I know there's something wrong. I'd be very thankful if someone could tell me what's it and provide a solution.
PD: When I use the tbl_svysummary alone (as in below) it totally works. But I need to do an stratified table with the variable DOMINIO :(
tbl1 = tbl_svysummary(data= dessin2, by= P601A,
include = c(I601B2, I601D2, I601Z2),
statistic = list(all_continuous() ~ "{mean} ({sd})"),
digits = list(all_continuous() ~ c(2, 2)))
tbl1
It looks like the issue is that some combinations of DOMINO and P601A have too few observations to calculate the summary statistics. In the example below, I removed groups with too few observations and it runs without error.
library(gtsummary)
library(survey)
packageVersion("gtsummary")
#> [1] '1.6.2.9001'
#Reading the data
load(url("https://github.com/cesarpoggi/PRUEBA/blob/main/PRUEBA_STACKOVERFLOW.rda?raw=true"))
df <-
PANxFAM[, c("I601B2", "P601A", "DOMINIO", "FACTOR07")] |>
tibble::as_tibble() %>%
dplyr::group_by(DOMINIO, P601A) |>
dplyr::filter(dplyr::n() > 5) |> # remove groups with too few obs
dplyr::ungroup()
#Setting the survey object
dessin2<- svydesign(id = ~1,
data = df,
weight = ~FACTOR07)
#Creating stratified table
tbl <-
dessin2 %>%
tbl_strata(
strata = DOMINIO,
.tbl_fun =
~ .x %>%
tbl_svysummary(
by = P601A,
include = c(I601B2)
)
)
Created on 2022-10-24 with reprex v2.0.2
Labels of variables (using Labelled package) do not carry over when subsetting the survey using (subset of Survey package), and I end up having to manually insert labels into the gtsummary function.
library(dplyr)
library(survey)
library(gtsummary)
library(labelled)
#Reading the CSV file (please download a sample dataframe from link below)
df <- read.csv("nis_2.csv")
#Change to factors
names <- c("htn", "dm", "FEMALE")
df[, names] <- lapply(included_df[, names], factor)
#Changing labels
var_label(df$AGE) <- "Age"
var_label(df$FEMALE) <- "Gender (Female)"
var_label(df$dm) <- "Diabetes"
var_label(df$htn) <- "Hypertension"
#declare survey design
dstr <- svydesign(
id = ~HOSP_NIS,
strata = ~NIS_STRATUM,
weights = ~DISCWT,
nest=TRUE,
survey.lonely.psu = "adjust",
data = df)
#subset the data to include our UGIB cirrhotics
small_set <- subset(dstr, (htn == 1))
summary(small_set)
small_set %>%
tbl_svysummary(
by=dm,
include = c(AGE, FEMALE),
missing = "no",
statistic = all_continuous() ~ "{mean} ({sd})"
) %>%
add_p() %>%
add_overall() %>%
modify_caption("**Table 1. Patient Characteristics**") %>%
modify_spanning_header(c("stat_1", "stat_2") ~ "**History of Diabetes**")
Sample database at: https://github.com/Dr-Kaboum/nis_gt_summary/blob/16909872624714d1feb30bd501a6204aba947de7/nis_2.csv
subset removes the labelled attributes, subset your data first, label it, and then pass to gtsummary
#example of the label being removed.
library(labelled)
var_label(mtcars$mpg) <- "Mile per gallon"
mt2 <- subset(mtcars, cyl == 4)
var_label(mt2$mpg) <- "Mile per gallon" #need to relabel
edit using var_label() on subset of survey object
I can't access your data but using an example set, I show that you can relabel the data when you access the variables part of the survey object list. If you label there it will show up in the table.
# A dataset with a complex design
library(gtsummary)
data(api, package = "survey")
labelled::var_label(apiclus1$api99) <- "API 99"
survdat <-
survey::svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
#this subset will remove labels
survdat2 <- subset(survdat, (cname == "Fresno"))
#relabel here after subset within survey object
labelled::var_label(survdat2$variables$api99) <- "API 99"
#make table with label
ex<- tbl_svysummary(data = survdat2,by = "both", include = c(cname, api00, api99, both))
I am using survey data from the National Electronic Injury Surveillance System (https://www.cpsc.gov/Research--Statistics/NEISS-Injury-Data) to research trends in consumer product injuries.
Using gtsummary and tbl_svysummary(), my goal is to create a descriptive table of summary measures of injuries. Since this is survey data, I want to display the 95% confidence interval associated with each summary measure.
This previous post provides a solution to generating confidence intervals for two level factor variables (Using (gtsummary) tbl_svysummaary function to display confidence intervals for survey.design object?), however, I am looking for a solution to produce confidence intervals for factor variables with >=2 levels.
I am borrowing a reproducible example from the previous post:
library(gtsummary)
library(survey)
svy_trial <-
svydesign(~1, data = trial %>% select(trt, response, death), weights = ~1)
ci <- function(variable, by, data, ...) {
svyby(as.formula( paste0( "~" , variable)) , by = as.formula( paste0( "~" , by)), data, svyciprop, vartype="ci") %>%
tibble::as_tibble() %>%
dplyr::mutate_at(vars(ci_l, ci_u), ~style_number(., scale = 100) %>% paste0("%")) %>%
dplyr::mutate(ci = stringr::str_glue("{ci_l}, {ci_u}")) %>%
dplyr::select(all_of(c(by, "ci"))) %>%
tidyr::pivot_wider(names_from = all_of(by), values_from = ci) %>%
set_names(paste0("add_stat_", seq_len(ncol(.))))
}
ci("response", "trt", svy_trial)
#> # A tibble: 1 x 2
#> add_stat_1 add_stat_2
#> <glue> <glue>
#> 1 21%, 40% 25%, 44%
svy_trial %>%
tbl_svysummary(by = "trt", missing = "no") %>%
add_stat(everything() ~ "ci") %>%
modify_table_body(
dplyr::relocate, add_stat_1, .after = stat_1
) %>%
modify_header(starts_with("add_stat_") ~ "**95% CI**") %>%
modify_footnote(everything() ~ NA)
Table screenshot from previous post 1
In the above example, the factor variables have two levels and summary data from 1 level is shown.
How can I tweak the above approach so that both levels of factor variables are displayed with their respective confidence intervals?
How can this solution be generalized to factor variables with >2 levels (e.g., an age variable binned as follows: <18 years, 18-25 years, 26-50 years, etc)?
Lastly, how could this desired solution also accommodate generating confidence intervals for continuous variables in the same column as the confidence intervals for factor variables?
Here is an example of the table I am trying to produce:
Screenshot of desired table output2
Apologies if this request for help doesn't follow good stack overflow etiquette (I'm fairly new to this community) and your time and assistance is much appreciated!
I have a prepared example for factors with >=2 levels, but not with a by= variable (although the approach is similar). FYI, we have an open issue to support survey objects more thoroughly with a new function add_ci.tbl_svysummary() that will calculate CIs for both categorical and continuous variables. You can click the "subscribe" link here to be alerted when this feature is implemented https://github.com/ddsjoberg/gtsummary/issues/965
In the meantime, here is a code example:
library(gtsummary)
library(tidyverse)
packageVersion("gtsummary")
#> [1] '1.5.0'
svy <- survey::svydesign(~1, data = as.data.frame(Titanic), weights = ~Freq)
# put the CI in a tibble with the variable name
# first create a data frame with each variable and it's values
df_result <-
tibble(variable = c("Class", "Sex", "Age", "Survived")) %>%
# get the levels of each variable in a new column
# adding them as a list to allow for different variable classes
rowwise() %>%
mutate(
# level to be used to construct call
level = unique(svy$variables[[variable]]) %>% as.list() %>% list(),
# character version to be merged into table
label = unique(svy$variables[[variable]]) %>% as.character() %>% as.list() %>% list()
) %>%
unnest(c(level, label)) %>%
mutate(
label = unlist(label)
)
# construct call to svyciprop
df_result$svyciprop <-
map2(
df_result$variable, df_result$label,
function(variable, level) rlang::inject(survey::svyciprop(~I(!!rlang::sym(variable) == !!level), svy))
)
# round/format the 95% CI
df_result <-
df_result %>%
rowwise() %>%
mutate(
ci =
svyciprop %>%
attr("ci") %>%
style_sigfig(scale = 100) %>%
paste0("%", collapse = ", ")
) %>%
ungroup() %>%
# keep variables needed in tbl
select(variable, label, ci)
# construct gtsummary table with CI
tbl <-
svy %>%
tbl_svysummary() %>%
# merge in CI
modify_table_body(
~.x %>%
left_join(
df_result,
by = c("variable", "label")
)
) %>%
# add a header
modify_header(ci = "**95% CI**")
Created on 2021-12-04 by the reprex package (v2.0.1)
I am trying to summarize three continuous variable by one categorical variable.
Here is some dummy data :
test <-
data.frame(
score_1= sample(c("low","medium","high"),50, replace = T),
land=rnorm(50,5,1),
water=rnorm(50,300,1),
fire=rnorm(50,3,1)
)
I can easily stratify the data by tertile:
table<- test %>%
tbl_summary(
by=score_1,
statistic = all_continuous()~ "{mean} ({sd})"
) %>%
print()
Which will make this table:
However I need to transpose this table: the continuous variables need to be the columns.
The reason for that is that I actually have two more score to add, so data is actually looks like this:
test2 <-
data.frame(
score_1= sample(c("low","medium","high"),50, replace = T),
score_2= sample(c("low","medium","high"),50, replace = T),
score_3= sample(c("low","medium","high"),50, replace = T),
land=rnorm(50,5,1),
water=rnorm(50,300,1),
fire=rnorm(50,3,1)
)
I thought of creating three tables, one for each score (with the continuous variables as columns), and then merging the three using tbl_stack. But I don't know how to make the first table (and if that even possible with gtsummary).
Hope that makes sense.
In the next release of gtsummary (v.1.5.0) the package will have a function designed to create tables just like the one you're requesting. While that new function is being vetted, you can use a similar (but not as easy to use) function from the bstfun package (on GitHub). bstfun is a package where some gtsummary functions are born, and when they mature they are migrated to gtsummary. Example code below!
# remotes::install_github("ddsjoberg/bstfun")
library(gtsummary)
test <-
data.frame(
score_1= sample(c("low","medium","high"),50, replace = T),
land=rnorm(50,5,1),
water=rnorm(50,300,1),
fire=rnorm(50,3,1),
all_one = 1L
)
df_pvalues <-
c("land", "water", "fire") %>%
purrr::imap_dfc(
~aov(
formula = glue::glue("{.x} ~ score_1") %>% as.formula(),
data = test
) %>%
broom::tidy() %>%
dplyr::slice(1) %>%
dplyr::select(p.value) %>%
dplyr::mutate_all(style_pvalue) %>%
setNames(glue::glue("stat_1_{.y}"))
) %>%
mutate(label = "ANOVA p-value")
df_pvalues
#> # A tibble: 1 x 4
#> stat_1_1 stat_1_2 stat_1_3 label
#> <chr> <chr> <chr> <chr>
#> 1 0.12 0.6 >0.9 ANOVA p-value
tbl <-
c("land", "water", "fire") %>%
purrr::map(
~test %>%
bstfun::tbl_2way_summary(score_1, all_one, con = all_of(.x),
statistic = "{mean} ({sd})") %>%
modify_header(all_stat_cols() ~ paste0("**", .x, "**"))
) %>%
tbl_merge() %>%
modify_spanning_header(everything() ~ NA) %>%
modify_table_body(
~.x %>%
dplyr::bind_rows(df_pvalues)
)
Created on 2021-08-25 by the reprex package (v2.0.1)
I need to fit many loess splines by the grouping variable (Animal) across multiple numeric columns (Var1, Var2), and extract these values.
I found code to do this task one variable at a time;
# Create dataframe 1
OneVarDF <- data.frame(Day = c(replicate(1,sample(1:50,200,rep=TRUE))),
Animal = c(c(replicate(100,"Greyhound"), c(replicate(100,"Horse")))),
Var1 = c(c(replicate(1,sample(2:10,100,rep=TRUE))), c(replicate(1,sample(15:20,100,rep=TRUE)))))
library(dplyr)
library(tidyr)
library(purrr)
# Get fitted values from each model
Models <- OneVarDF %>%
tidyr::nest(-Animal) %>%
dplyr::mutate(m = purrr::map(data, loess, formula = Var1 ~ Day, span = 0.30),
fitted = purrr::map(m, `[[`, "fitted")
)
# Create prediction column
Results <- Models %>%
dplyr::select(-m) %>%
tidyr::unnest()
This "Results" dataframe is essential for downstream tasks (detrending many non-parametric distributions).
How can we achieve this with a dataframe with multiple numeric columns (code below), and extract a "Results" dataframe? Thank you.
# Create dataframe 2
TwoVarDF <- data.frame(Day = c(replicate(1,sample(1:50,200,rep=TRUE))),
Animal = c(c(replicate(100,"Greyhound"), c(replicate(100,"Horse")))),
Var1 = c(c(replicate(1,sample(2:10,100,rep=TRUE))), c(replicate(1,sample(15:20,100,rep=TRUE)))),
Var2 = c(c(replicate(1,sample(22:27,100,rep=TRUE))), c(replicate(1,sample(29:35,100,rep=TRUE)))))
We can get the data in long format using. pivot_longer, group_by Animal and column name and apply loess to each combinaton.
library(dplyr)
library(tidyr)
TwoVarDF %>%
pivot_longer(cols = starts_with('Var')) %>%
group_by(Animal, name) %>%
mutate(model = loess(value~Day, span = 0.3)$fitted)
Include a gather() function to proceed as similar to your previous code.
Models2 <- TwoVarDF %>%
gather(varName, varVal, 3:4) %>%
tidyr::nest(-Animal, -varName) %>%
dplyr::mutate(m = purrr::map(data, loess, formula = varVal ~ Day, span = 0.30),
fitted = purrr::map(m, `[[`, "fitted")
)