Remove p-value column of gtsummary linear regression table - r

I am using gtsummary to summarise my linear regression results. I am trying to omit out the p-value for each sex (column.
Any support on this would be greatly appreciated. I have included dummy data to reproduce what I am trying to do, as well as an image of my linear reg table as it stand.
# install dev versions
remotes::install_github("ddsjoberg/gtsummary#mice_nnet")
remotes::install_github("larmarange/broom.helpers")
# load packages
library(gtsummary)
library(nnet)
theme_gtsummary_compact()
# dummy data
crime <-data.frame(city = sample(c("SF", "AR", "NYC","MN"),13000,replace = TRUE),
sex = sample(c("Male", "Female"),13000,replace = TRUE),
year = sample(as.numeric(sample(10:20, 13000, replace = TRUE)))
)
# serperate data sets by sex
crime_f <- crime %>%
filter(sex == "Female")
crime_m <- crime %>%
filter(sex == "Male")
# build model for females
mod_f <- lm(year ~ city, data = crime_f) %>%
tbl_regression(exponentiate = TRUE) %>%
modify_header(estimate ~ "**OR**")
# build model for males
mod_m <- lm(year ~ city, data = crime_m) %>%
tbl_regression(exponentiate = TRUE) %>%
modify_header(estimate ~ "**OR**")
# lm model tabulated with gtsummary
tbl <- tbl_merge(
tbls = list(mod_f, mod_m),
tab_spanner = c("**Female**", "**Male**")
)
tbl # check table

With the modify_table_header() function you can chose to hide columns in your output, including p-values:
tbl %>%
modify_table_header(
column = c(p.value_1, p.value_2),
hide = TRUE
)
Good luck!

Related

I need assistance with a nested for-loop function which outputs multiple tables

I'm using the diamonds data set as hypothetical data.
I'm trying to compare the performance of two separate models in predicting a binary variable via misclassification error. The binary variable is decided upon arbitrary thresholds, which would be set at the beginning of the code (prior to making the models and both models are run with different thresholds each time). I want to do a for loop so that it sets different thresholds each time the loop runs. My goal is to obtain tables for the misclassification error for each threshold. Please bear with my inelegant codes. Any help is so greatly appreciated.
library(tidyverse)
library(rpart)
library(rpart.plot)
data = diamonds
threshold = c(0.20, 0.70, 2.00)
for (i in 1:length(threshold)) {
data_filtered = data %>%
select(-c(cut, color, clarity)) %>%
mutate(carat_dummy = ifelse(carat > threshold, 1, 0)) %>%
drop_na()
set.seed(14643)
data_filtered = data_filtered %>%
mutate(train_test = sample(c("train", "test"), n(), replace = TRUE, prob = c(0.75, 0.25)))
data_train = data_filtered %>%
filter(train_test == "train")
data_test = data_filtered %>%
filter(train_test == "test")
#regression
reg = lm(carat_dummy ~ depth + table + price, data = data_train)
#decision Trees
data_train$carat_dummy = factor(data_train$carat_dummy)
trees = rpart(carat_dummy ~ depth + table + price, data = data_train)
tree_predict = predict(trees, newdata = data_test, type = "class")
#Linear regression error
predict = predict(reg, data_test, type="response")
prediction_reg = data_test %>%
mutate(prediction = ifelse(predict >= 0.5,1,0)) %>%
select(carat_dummy, prediction)
error_reg = prediction_reg %>%
summarise(misclassification = mean(carat_dummy != prediction)) %>%
add_column("Method" = "Linear Regression") %>%
select(Method, everything())
#Decision tree error
prediction_tree = data_test %>%
mutate(prediction = as.numeric(as.character(tree_predict))) %>%
select(carat_dummy, prediction)
error_tree = prediction_tree %>%
summarise(misclassification = mean(carat_dummy != prediction)) %>%
add_column("Method" = "Decision Trees") %>%
select(Method, everything())
errors_t = rbind(error_reg, error_tree)
}

Multivariate time series - is there notation to select all the variables, or do they all have to be written out?

I'm working to build a multivariate time series to make predictions about labor in the United States. The fpp3 package is excellent, but I don't see a notation to model all the variables.
For example, in linear regression, it's possible to do this:
library(tidyverse)
mtcars.lm <- lm(mpg ~ ., data = mtcars)
summary(mtcars.lm)
to model mpg on all the remaining variables, without having to write all the variables out explicity. Is there something similar in time series using the fpp3 package?
For example, this returns an error:
library(tidyverse)
library(fpp3)
library(clock)
# Source: https://beta.bls.gov/dataViewer/view/timeseries/CES0000000001
All_Employees <- read_csv('https://raw.githubusercontent.com/InfiniteCuriosity/predicting_labor/main/All_Employees.csv', col_select = c(Label, Value), show_col_types = FALSE)
All_Employees <- All_Employees %>%
rename(Month = Label, Total_Employees = Value)
All_Employees <- All_Employees %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index = Month) %>%
mutate(Total_Employees_Diff = difference(Total_Employees))
index = All_Employees$Month
All_Employees <- All_Employees %>%
filter((Month >= start_month), (Month <= end_month))
# Source: https://beta.bls.gov/dataViewer/view/timeseries/CES0500000003
Average_Hourly_Earnings <- read_csv('https://raw.githubusercontent.com/InfiniteCuriosity/predicting_labor/main/Average_Hourly_Earnings.csv', col_select = c(Label, Value), show_col_types = FALSE)
Average_Hourly_Earnings <- Average_Hourly_Earnings %>%
rename(Month = Label, Avg_Hourly_Earnings = Value)
Average_Hourly_Earnings <- Average_Hourly_Earnings %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index = Month) %>%
mutate(Avg_Hourly_Earnings_Diff = difference(Avg_Hourly_Earnings))
Average_Hourly_Earnings <- Average_Hourly_Earnings %>%
filter((Month >= start_month), (Month <= end_month))
Monthly_labor_data_small <-
tsibble(
Month = All_Employees$Month,
index = Month,
'Total_Employees' = All_Employees$Total_Employees,
'Avg_Earnings' = Average_Hourly_Earnings$Avg_Hourly_Earnings
)
start_month_small = yearmonth("2020 Mar")
end_month_small = yearmonth("2022 Jan")
Monthly_labor_data_small <- Monthly_labor_data_small %>%
filter((Month >= start_month_small), (Month <= end_month_small))
Monthly_labor_data_small %>%
model(
linear = TSLM(Total_Employees ~ .,))
The error is: Error in TSLM(Total_Employees ~ ., ) : unused argument (alist())
But this runs fine if I list everything out:
fit <- Monthly_labor_data_small %>%
model(
linear = TSLM(Total_Employees ~ Avg_Earnings + season() + trend()))
report(fit)
The full tsibble will have a large number of columns, is there a short way to list all of them, similar to what can be done in linear regression?
You should be able to do something like
resp <- "Total_Employees"
form <- reformulate(response = resp,
c(setdiff(names(Monthly_labor_data_small), resp),
"season()", "trend()"))
And then use form in your model. I haven't tried your examples -- if there are other variables (like a time index) that should not be explicitly included in the model then the second argument to setdiff() should be c(resp, "excluded_var2", "excluded_var3")

How to deal with a column with only one value?

How to add a step to remove a column with constant value?
I am facing a related problem so referencing the previous article above. I used step_zv() in my recipe but I still get the following error- Error in bake(), Only one factor in Column 'X33': "TRUE"
library(tidymodels)
library(readr)
library(broom.mixed)
library(dotwhisker)
library(skimr)
library(rpart.plot)
library(vip)
library(glmnet)
library(naniar)
library(tidyr)
library(dplyr)
library(textrecipes)
# Data cleaning
skool <-
read_csv("/Users/riddhimaagupta/Desktop/log1.csv")
skool_v1 <-
select (skool, -c(...1, id, npsn, public, cert_est, cert_ops, name_clean, name, muh1, muh2, muh, chr1, chr2, chr3, chr, hindu, nu1, nu2, nu_klaten, nu_sby, nu, it1, it, other_swas_international))
skool_v2 <-
filter(skool_v1, afiliasi != 99)
skool_v2.1 <- replace_with_na(skool_v2,
replace = list(village = c("-")))
skool_v2.2 <- replace_with_na(skool_v2.1,
replace = list(area = c("0")))
skool_v2.3 <- replace_with_na(skool_v2.2,
replace = list(date_est = c("-")))
skool_v2.3$date_est <- as.Date(skool_v2.3$date_est, format = '%Y-%m-%d')
skool_v2.3$date_ops <- as.Date(skool_v2.3$date_ops, format = '%Y-%m-%d')
skool_v2.3$latlon <- gsub(".*\\[", "", skool_v2.3$latlon)
skool_v2.3$latlon <- gsub("\\].*", "", skool_v2.3$latlon)
skool_v2.4 <- skool_v2.3 %>%
separate(latlon, c("latitude", "longitude"), ",")
skool_v2.4$latitude <- as.numeric(skool_v2.4$latitude)
skool_v2.4$longitude <- as.numeric(skool_v2.4$longitude)
skool_v3 <- skool_v2.4 %>%
mutate_if(is.character, tolower) %>%
mutate_if(is.character, as.factor)
skool_v4 <- skool_v3 %>%
mutate_if(is.logical, as.factor)
skool_v4$afiliasi <- as.factor(skool_v4$afiliasi)
glimpse(skool_v4)
# Data splitting
set.seed(123)
splits <- initial_split(skool_v4 , strata = afiliasi)
school_train <- training(splits)
school_test <- testing(splits)
set.seed(234)
val_set <- validation_split(skool_v4,
strata = afiliasi,
prop = 0.80)
# Penalised multinomial regression
lr_mod <-
logistic_reg(penalty = tune(), mixture = 0.5) %>%
set_engine("glmnet")
lr_recipe <-
recipe(afiliasi ~ ., data = school_train) %>%
step_date(date_est, date_ops) %>%
step_rm(date_est, date_ops) %>%
textrecipes::step_clean_levels(village) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
lr_workflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(lr_recipe)
lr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
lr_reg_grid %>% top_n(-5)
lr_reg_grid %>% top_n(5)
lr_res <-
lr_workflow %>%
tune_grid(val_set,
grid = lr_reg_grid,
control = control_grid(save_pred = TRUE, verbose = TRUE),
metrics = metric_set(roc_auc))
The console says
x validation: preprocessor 1/1: Error in `bake()`:
! Only one factor...
Warning message:
All models failed. See the `.notes` column.
This error comes from step_dummy() because the variable X33 only has one factor "TRUE". The easiest way to deal with this in your problem is to use step_zv() on the nominal predictors before step_dummy().
This would make your recipe look like
lr_recipe <-
recipe(afiliasi ~ ., data = school_train) %>%
step_date(date_est, date_ops) %>%
step_rm(date_est, date_ops) %>%
textrecipes::step_clean_levels(village) %>%
step_zv(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
Reprex showing what is happening:
library(recipes)
mtcars$fac1 <- "h"
mtcars$fac2 <- rep(c("a", "b"), length.out = nrow(mtcars))
recipe(mpg ~ ., data = mtcars) %>%
step_dummy(all_nominal_predictors()) %>%
prep()
#> Error in `bake()`:
#> ! Only one factor level in fac1: h
recipe(mpg ~ ., data = mtcars) %>%
step_zv(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
prep()
#> Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 12
#>
#> Training data contained 32 data points and no missing data.
#>
#> Operations:
#>
#> Zero variance filter removed fac1 [trained]
#> Dummy variables from fac2 [trained]
Here's an example with mtcars:
# Add a column with only one value
mtcars$constant_col <- 1
# Remove any columns with only one value
mtcars[sapply(mtcars, function(x) length(unique(x)) == 1)] <- NULL

Creating a tbl_continuous with weight data using gtsummary

Problem:
Can't find a way to create a tbl_continuous of a weighted numerical variable. I'm using tbl_svysummary to create my categorical variable tables but it's not useful when i try to do the same with numerical variable tables. Note: i'm not looking for a general mean of my numeric variable but separated by or groupped by a categorical variable.
Attempt:
For example, i've created this table with the help of the function tbl_continuous which does exactly what i want: the mean of my numeric variable but by the levels of my categorical variable. The only problem is that i can't insert a weight variable into this function.
```{r}
base2 %>%
as_label() %>%
select(ing_cap, ano, nacional, dominio) %>%
tbl_continuous(variable = ing_cap,
by = ano,
statistic = list(everything() ~ "{median}"))
```
Also, i have been creating weighted data with the srvyr package in the following way:
base2 %>%
labelled::drop_unused_value_labels() %>%
as_label() %>%
as_survey_design(weight = fac500a)
Could add up to the solution.
Request:
Create this same table (shown in the image) but with a weight variable. My weight variable in my data is called fac500a.
My data:
My data can be dowloaded from my github repo and has the following dimensions:
> dim(base2)
[1] 108103 44
https://github.com/aito123/quarto_blog/raw/master/posts/tablas_tesis/base2.sav
(dput output is long)
My current packages:
I'm using this r packages so far: tidyverse, srvyr, gtsummary, sjlabelled, haven
Conclusion:
Let me know if it's neccesary to provide more information.
The gtsummary package does not export an analogous function of tbl_continuous() for survey data. But you can construct the table. Example below!
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.6.0'
svy_trial <- survey::svydesign(ids = ~1, data = trial, weights = ~1)
tbl <-
svy_trial %>%
tbl_strata2(
strata = grade,
~ .x %>%
tbl_svysummary(
by = trt,
include = age,
missing = "no",
label = list(age = .y)
) %>%
modify_header(all_stat_cols() ~ "**{level}**"),
.combine_with = "tbl_stack",
.combine_args = list(group_header = NULL)
) %>%
modify_table_body(
~ .x %>%
mutate(variable = "grade", row_type = "level") %>%
tibble::add_row(
row_type = "label",
variable = "grade",
label = "Grade",
.before = 1L
)
) %>%
modify_column_indent(columns = label, rows = row_type == "level") %>%
bold_labels() %>%
modify_spanning_header(all_stat_cols() ~ "**Treatment**") %>%
modify_footnote(all_stat_cols() ~ "Age: Median (IQR)")
Created on 2022-05-16 by the reprex package (v2.0.1)

Grouping Rows in GTSummary

I am trying to group some rows/variables (both categorical and continuous) to help with the table readability in a large dataset.
Here is the dummy dataset:
library(gtsummary)
library(tidyverse)
library(gt)
set.seed(11012021)
# Create Dataset
PIR <-
tibble(
siteidn = sample(c("1324", "1329", "1333", "1334"), 5000, replace = TRUE, prob = c(0.2, 0.45, 0.15, 0.2)) %>% factor(),
countryname = sample(c("NZ", "Australia"), 5000, replace = TRUE, prob = c(0.3, 0.7)) %>% factor(),
hospt = sample(c("Metropolitan", "Rural"), 5000, replace = TRUE, prob = c(0.65, 0.35)) %>% factor(),
age = rnorm(5000, mean = 60, sd = 20),
apache2 = rnorm(5000, mean = 18.5, sd=10),
apache3 = rnorm(5000, mean = 55, sd=20),
mechvent = sample(c("Yes", "No"), 5000, replace = TRUE, prob = c(0.4, 0.6)) %>% factor(),
sex = sample(c("Female", "Male"), 5000, replace = TRUE) %>% factor(),
patient = TRUE
) %>%
mutate(patient_id = row_number())%>%
group_by(
siteidn) %>% mutate(
count_site = row_number() == 1L) %>%
ungroup()%>%
group_by(
patient_id) %>% mutate(
count_pt = row_number() == 1L) %>%
ungroup()
Then I use the following code to generate my table:
t1 <- PIR %>%
select(patientn = count_pt, siten = count_site, age, sex, apache2, apache3, apache2, mechvent, countryname) %>%
tbl_summary(
by = countryname,
missing = "no",
statistic = list(
patientn ~ "{n}",
siten ~ "{n}",
age ~ "{mean} ({sd})",
apache2 ~ "{mean} ({sd})",
mechvent ~ "{n} ({p}%)",
sex ~ "{n} ({p}%)",
apache3 ~ "{mean} ({sd})"),
label = list(
siten = "Number of ICUs",
patientn = "Number of Patients",
age = "Age",
apache2 = "APACHE II Score",
mechvent = "Mechanical Ventilation",
sex = "Sex",
apache3 = "APACHE III Score")) %>%
modify_header(stat_by = "**{level}**") %>%
add_overall(col_label = "**Overall**")
t2 <- PIR %>%
select(patientn = count_pt, siten = count_site, age, sex, apache2, apache3, apache2, mechvent, hospt) %>%
tbl_summary(
by = hospt,
missing = "no",
statistic = list(
patientn ~ "{n}",
siten ~ "{n}",
age ~ "{mean} ({sd})",
apache2 ~ "{mean} ({sd})",
mechvent ~ "{n} ({p}%)",
sex ~ "{n} ({p}%)",
apache3 ~ "{mean} ({sd})"),
label = list(
siten = "Number of ICUs",
patientn = "Number of Patients",
age = "Age",
apache2 = "APACHE II Score",
mechvent = "Mechanical Ventilation",
sex = "Sex",
apache3 = "APACHE III Score")) %>%
modify_header(stat_by = "**{level}**")
tbl <-
tbl_merge(
tbls = list(t1, t2),
tab_spanner = c("**Country**", "**Hospital Type**")
) %>%
modify_spanning_header(stat_0_1 ~ NA) %>%
modify_footnote(everything() ~ NA)
This produces the following table:
I would like to group certain rows together for ease of reading. Ideally, I would like the table to look like this:
I have attempted using the gt package, with the following code:
tbl <-
tbl_merge(
tbls = list(t1, t2),
tab_spanner = c("**Country**", "**Hospital Type**")
) %>%
modify_spanning_header(stat_0_1 ~ NA) %>%
modify_footnote(everything() ~ NA) %>%
as_gt() %>%
gt::tab_row_group(
group = "Severity of Illness Scores",
rows = 7:8) %>%
gt::tab_row_group(
group = "Patient Demographics",
rows = 3:6) %>%
gt::tab_row_group(
group = "Numbers",
rows = 1:2)
This produces the desired table:
There are a couple of issues I'm having with the way that I'm doing this.
When I try to use the row names (variables), an error message comes up (Can't subset columns that don't exist...). Is there a way to do this by using the variable names? With larger tables, I am getting into some trouble with using the row numbers method of assigning row names. This is particularly true when there is a single variable that loses its place as it's moved to the end to account for the grouped rows.
Is there a way to do this prior to piping into tbl_summary? Although I like the output of this table, I use Word as my output document for statistical reports and would like the ability to be able to format the tables in Word if need be (or by my collaborators). I usually use gtsummary::as_flextable for table output.
Thanks again,
Ben
When I try to use the row names (variables), an error message comes up (Can't subset columns that don't exist...). Is there a way to do this by using the variable names? With larger tables, I am getting into some trouble with using the row numbers method of assigning row names. This is particularly true when there is a single variable that loses its place as it's moved to the end to account for the grouped rows.
There are two ways to go about this, 1. build separate tables for each group, then stack them, and 2. add a grouping column to .$table_body then group the tibble by the new variable.
library(gtsummary)
library(dplyr)
packageVersion("gtsummary")
#> '1.3.6'
# Method 1 - Stack separate tables
t1 <- trial %>% select(age) %>% tbl_summary()
t2 <- trial %>% select(grade) %>% tbl_summary()
tbl1 <-
tbl_stack(
list(t1, t2),
group_header = c("Demographics", "Tumor Characteristics")
) %>%
modify_footnote(all_stat_cols() ~ NA)
# Method 2 - build a grouping variable
tbl2 <-
trial %>%
select(age, grade) %>%
tbl_summary() %>%
modify_table_body(
mutate,
groupname_col = case_when(variable == "age" ~ "Deomgraphics",
variable == "grade" ~ "Tumor Characteristics")
)
2.Is there a way to do this prior to piping into tbl_summary? Although I like the output of this table, I use Word as my output document for statistical reports and would like the ability to be able to format the tables in Word if need be (or by my collaborators). I usually use gtsummary::as_flextable for table output.
The examples above modify the table before exporting to gt format, so you can export these example to flextable. However, flextable does not have the same built-in header row functionality (or at least I am unaware of it, and don't use it in as_flex_table()), and the output would look like the table below. I recommend installing the dev version of gt from GitHub and export to RTF (supported by Word)--they've made many updates to RTF output in the last months, and it may work for you.
I think I might have a solution for this (thanks, obviously, to Daniel Sjoberg and team providing us with the modify_table_body function)
All you need to do is edit the underlying data frame to add a variable with your desired grouping row using modify_table_body, and then put it in the position you want it to be in, like this:
library(gtsummary)
library(dplyr)
packageVersion("gtsummary")
trial%>%
select(age, stage, grade)%>%
tbl_summary()%>%
modify_table_body(
~.x %>%
# add your variable
rbind(
tibble(
variable="Demographics",
var_type=NA,
var_label = "Demographics",
row_type="label",
label="Demographics",
stat_0= NA))%>% # expand the components of the tibble as needed if you have more columns
# can add another one
rbind(
tibble(
variable="Tumor characteristics",
var_type=NA,
var_label = "Tumor characteristics",
row_type="label",
label="Tumor characteristics",
stat_0= NA))%>%
# specify the position you want these in
arrange(factor(variable, levels=c("Demographics",
"age",
"Tumor characteristics",
"stage",
"grade"))))%>%
# and you can then indent the actual variables
modify_column_indent(columns=label, rows=variable%in%c("age",
"stage",
"grade"))%>%
# and double indent their levels
modify_column_indent(columns=label, rows= (variable%in%c("stage",
"grade")
& row_type=="level"),
double_indent=T)

Resources