Treatment effect table in R with horizontally-oriented variables - r

So i subsetted a dataframe to keep only my 4 columns of interest. I want to count the number of control (0) and treated (1) observations. I computed something with the gtsummary package, but the variables are vertically oriented (like here http://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html), one below each other, and this is not what i want. I searched on google but all the tables i saw have this orientation too.
I put here a picture of what i'd like to obtain, if some of you have any idea !
code i use to obtain my initial table (same as in the link)
install.packages("gtsummary")
library(gtsummary)
trial <- finaldf %>% select(treatment, 2digID,4digID,classificationsdescription)
trial %>% tbl_summary()
t2 <- trial %>% tbl_summary(by = treatment)
I cannot put the real data but i created an example that looks like my data :
_2ID <- c(38,38,38,38,38,38,38,38,38,38,80,80,80,80,80,80,80,80,80,80)
_4ID <- c(3837,3837,3837,3812,3812,3896,3894,3894,3877,3877, 8099,8099,8027,8027,8027,8033,8033,8064,8064,8022)
descriptions <- c('ILL1','ILL1','ILL1', 'ILL2','ILL2','ILL3','ILL4','ILL4','ILL5','ILL5','ILL1','ILL1','ILL2','ILL2','ILL2','ILL3','ILL3','ILL4','ILL4','ILL5')
trt <-c(0,0,0,1,1,1,0,0,1,1,0,0,1,1,1,0,0,1,1,0)
df.data <- data.frame(_2ID,_4ID,descriptions, trt)
UPDATE - SOLVED
I think i managed to solve this problem even if my output is a dataframe and not a "publication-ready" table :
install.packages("reshape2")
library(reshape2)
data_wide <- dcast(df,_2digID+_4digID+descriptions ~ treatment, value.var="counts")
But i'm not sure yet that this gives the right numbers tho.

The example below gets you close, but not exactly what you're after. I like the idea of being able to support tables like this, and I'll add it to the list of features to implement!
library(gtsummary)
#> #Uighur
packageVersion("gtsummary")
#> [1] '1.4.1'
tbl <-
trial %>%
mutate(
grade = paste("Grade", as.character(grade)),
stage = paste("Stage", as.character(stage))
) %>%
tbl_strata(
strata = c(stage, grade),
~ .x %>%
tbl_summary(by = trt,
include = response,
type = response ~ "categorical",
missing = "no",
statistic = response ~ "{n}") %>%
modify_header(all_stat_cols() ~ "**{level}**"),
.combine_with = "tbl_stack"
) %>%
as_flex_table()
Table truncated because it was very long!
Created on 2021-07-14 by the reprex package (v2.0.0)

Related

how to use R package `caret` to run `pls::plsr( )` with multiple responses

the caret::train() does not seem to accept y if y is a matrix of multiple columns.
Thanks for any help!
That's correct. Perhaps you want the tidymodels package? Kuhn has said there would be support for multivariate response in it. Here's evidence in favor of my suggestion: https://www.tidymodels.org/learn/models/pls/
Do a search of that document for plsr:
library(tidymodels)
library(pls)
get_var_explained <- function(recipe, ...) {
# Extract the predictors and outcomes into their own matrices
y_mat <- bake(recipe, new_data = NULL, composition = "matrix", all_outcomes())
x_mat <- bake(recipe, new_data = NULL, composition = "matrix", all_predictors())
# The pls package prefers the data in a data frame where the outcome
# and predictors are in _matrices_. To make sure this is formatted
# properly, use the `I()` function to inhibit `data.frame()` from making
# all the individual columns. `pls_format` should have two columns.
pls_format <- data.frame(
endpoints = I(y_mat),
measurements = I(x_mat)
)
# Fit the model
mod <- plsr(endpoints ~ measurements, data = pls_format)
# Get the proportion of the predictor variance that is explained
# by the model for different number of components.
xve <- explvar(mod)/100
# To do the same for the outcome, it is more complex. This code
# was extracted from pls:::summary.mvr.
explained <-
drop(pls::R2(mod, estimate = "train", intercept = FALSE)$val) %>%
# transpose so that components are in rows
t() %>%
as_tibble() %>%
# Add the predictor proportions
mutate(predictors = cumsum(xve) %>% as.vector(),
components = seq_along(xve)) %>%
# Put into a tidy format that is tall
pivot_longer(
cols = c(-components),
names_to = "source",
values_to = "proportion"
)
}
#We compute this data frame for each resample and save the results in the different columns.
folds <-
folds %>%
mutate(var = map(recipes, get_var_explained),
var = unname(var))
#To extract and aggregate these data, simple row binding can be used to stack the data vertically. Most of the action happens in the first 15 components so let’s filter the data and compute the average proportion.
variance_data <-
bind_rows(folds[["var"]]) %>%
filter(components <= 15) %>%
group_by(components, source) %>%
summarize(proportion = mean(proportion))
This might not be a reproducible code block. May need additional data or packages.

Problem `.x` is empty in pammtools packages

I am trying to replicate the example code in Bender and Schleip for Piece-wise exponential Additive Mixed modelling tools. Specifically a survival exercise with time varying effects.
https://arxiv.org/pdf/1806.01042.pdf
library(dplyr); library(tidyr); library(purrr); library(ggplot2)
library(survival); library(mgcv); library(pammtools)
data("pbc", package="survival")
# event time information
pbc <- pbc %>%
filter(id <= 312) %>%
mutate(status = ifelse(status==0,0,1) )%>%
select(id:status, trt:sex, bili, protime)
pbc %>% slice(1:6)
pbc_ped <- as_ped(
data = list(pbc, pbcseq),
formula = Surv(pbc$time, pbc$status)~sex|concurrent(bili, protime, tz_var = "day"),
id = "id")
I always get the error
Error: .x is empty, and no .init supplied
I installed and checked Rtools, I tried with different (older) version of Purrr, which sometimes is related with this error. I tried to run the code also on https://rdrr.io/snippets/.
Any idea? thank you very much...
You have not used the code in that vignette. And you added pbc$ to the arguments in Surv(), a common mistake but generally not a productive strategy
# Need to narrow the material from pbcseq
pbcseq <- pbcseq %>% select(id, day, bili, protime)
# I would have given it a different name
#------ Error when using "|" rather than "+"
pbc_ped <- as_ped(
data = list(pbc, pbcseq),
formula = Surv(time, status)~sex|concurrent(bili, protime, tz_var = "day"),
id = "id")
#Error: `.x` is empty, and no `.init` supplied
#________________
pbc_ped <- as_ped(
data = list(pbc, pbcseq),
formula = Surv(time, status)~sex + concurrent(bili, protime, tz_var = "day"),
id = "id") # No error
I think there may be an error in the vignette. I don't see any examples using the construct ...
Surv(time,status)~ variates | special(.)
They all use a "+" sign for adding the time-dependent covariates. If you go to https://adibender.github.io/pammtools//articles/data-transformation.html you see them using a "+" rather than a "|". I think there is some sloppiness in that package's documentation. But your additions only made the problem worse.

lm function is giving a warning that it is dropping rows

This is my question
Do the developers that make more games charge higher prices?
my code:
dev_data <- steam_data_final %>%
group_by(developer) %>%
summarize(num_dev = n(), avg_price = mean(price, na.rm = TRUE)) %>%
arrange(desc(num_dev))
dev_data
but this model isn't working, getting Warning: Dropping 3038 rows with missing values
mod_dev <- lm(num_dev ~ avg_price, data = dev_data)
Check if you have any NA using summary() or is.na() for each column. If you do have any NA, then it is the reason why the lm() gives you the warning message.
Also, it seems like you need to use lm(avg_price ~ num_dev, data = dev_data) instead of lm(num_dev ~ avg_price, data = dev_data). It seems like the dependent variable should be avg_price, not num_dev. (It depends on your question of research.)

How do I set which level is the "event" in my outcome variable using tidymodels?

I am using tidymodels for machine learning and I want to predict a binary response/outcome. How do I specify which level of the outcome is the "event" or positive case?
Does this happen in the recipe, or somewhere else?
##split the data
anxiety_split <- initial_split(anxiety_df, strata = anxiety)
anxiety_train <- training(anxiety_split)
anxiety_test <- testing(anxiety_split)
set.seed(1222)
anxiety_cv <- vfold_cv(anxiety_train, strata = anxiety)
anxiety_rec <- recipe(anxiety ~ ., data = anxiety_train, positive = 'pos') %>%
step_corr(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_numeric()) %>%
step_normalize(all_numeric())
You don't need to set which level of your outcome variable is the "event" until it is time to evaluate your model. You can do this using the event_level argument of most yardstick functions. For example, check out how to do this for yardstick::roc_curve():
library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.
library(tidyverse)
data(two_class_example)
## looks good!
two_class_example %>%
roc_curve(truth, Class1, event_level = "first") %>%
autoplot()
## YIKES!! we got this backwards
two_class_example %>%
roc_curve(truth, Class1, event_level = "second") %>%
autoplot()
Created on 2020-08-02 by the reprex package (v0.3.0.9001)
Notice the message on startup for yardstick; the first factor level is assumed to be the event. This is similar to how base R acts. You only need to worry about event_level if your "event" is not the first factor level.

R gtsummary package: How to Manipulate / Hide Rows in Summary Table

I am working on a project with gtsummary. For one of the tables, I have to build a long table listing covariables before and after the matchit process.
My issue is that for all of the covariables (Obesity, for example), it reads one row, Obesity, then next row, Obese, and then the next, Not Obese. That is three tables for which I wish to only show one: Diabetes N (%).
I have tried editing dichotomous variables, introducing Null, trying to find a row_hide function, but to no avail.
Here is my code:
Creation of trial
trialCAS1 <- index_CAS %>%
select(TopDecile, Gender, Obesity, Diabetes, Diabetes_Complex, etc)
Tbl summary
CAStable1 <- tbl_summary(trialCAS1,
by = TopDecile,
missing = "no") %>%
add_n() %>%
modify_header(label = "**Variable**") %>%
bold_labels()
I included the first table I get.
The tbl_summary() function tries its best to guess the type of data passed (categorical, dichotomous, and continuous). It doesn't always guess what we'd like to see, but the default can always be changed using arguments in tbl_summary()! I'll use the trial data set in the {gtsummary} package as an example.
Here is the default output:
library(gtsummary)
trial %>%
select(trt, grade, stage) %>%
tbl_summary(by = trt)
By default, the summary statistics for grade and stage are shown on multiple rows. Imagine, however, we are only interested in the rate of Grade I disease and the rate of Stage T1 cancer. We can use the tbl_summary(value=) argument to specify these are the only values we want displayed (which will then default to print these as dichotomous variables). In the example below, I have also updated the label displayed to indicate these are Grade I and Stage T1 rates only.
trial %>%
select(trt, grade, stage) %>%
tbl_summary(
by = trt,
value = list(grade ~ "I",
stage ~ "T1"),
label = list(grade ~ "Grade I",
stage ~ "Stage T1")
)
Based on what I see from your code and output, I think this code will work for you on your data set:
tbl_summary(
trialCAS1,
by = TopDecile,
missing = "no".
value = Obesity ~ "Obese",
label = Obesity ~ "Obese"
)

Resources