How to mask certain cell's info from a gtsummary table - r

I have some sensitive info in my data that I need to hide below a certain threshold (to comply with DUA and prevent reidentifying data). I'm using tbl_svysmmary() from gtsummary. In my example, I like to filter "cell size ≤ 100":
library(gtsummary)
library(survey)
tbl_svysummary <-
svydesign(~1, data = as.data.frame(Titanic), weights = ~Freq) %>%
tbl_svysummary(by = Survived, percent = "row", include = c(Class, Age))
tbl_svysummary
I want to show child info as:

You can do more tidyverse manipulation on the output object like this (using the table_body):
library(tidyverse)
data(mtcars)
library(gtsummary)
output <- mtcars[,1:2] %>% tbl_summary()
output$table_body
# A tibble: 5 × 6
variable var_type var_label row_type label stat_0
<chr> <chr> <chr> <chr> <chr> <chr>
1 mpg continuous mpg label mpg 19.2 (15.4, 22.8)
2 cyl categorical cyl label cyl NA
3 cyl categorical cyl level 4 11 (34%)
4 cyl categorical cyl level 6 7 (22%)
5 cyl categorical cyl level 8 14 (44%)
# Don't show cell information when N of cyl is less than 10
output$table_body <- output$table_body %>%
mutate(extra = stat_0) %>%
separate(extra, c("number"), sep = ' \\(') %>%
mutate(number = as.numeric(number)) %>%
mutate(stat_0 = case_when(number < 10 & var_type == "categorical" ~ "TOO FEW",
TRUE ~ stat_0)) %>%
select(!number)
output
The filter decision in case_when is based on categorical variables, where you like to control the cell information, when it is less than a threshold value. You can adjust this for any other variable or condition.

Related

Running Levene's test for each column of a df in R

I have a data frame containing scores of several sub-scales of the same test (columns: participant, session, group, total score, one column per sub-scale). I am trying to run assumption checks for a two-way mixed ANOVA for each sub-scale. For convenience, I would like to write one loop per assumption check, that gives me the output for all sub-scales. This worked well for checking outliers, running Box's M test and for generating the actual ANOVA output. However, I get an error when trying the same thing with Levene's test. See code and errors below:
subscales <- c("awareness", "clarity", "impulse", "goals", "nonacceptance",
"strategies") # these correspond to the column names in the df
for (scale in subscales) {
ders %>%
group_by(session) %>%
levene_test(scale ~ group) %>%
kable(caption = scale) %>% print()
}
Error in mutate(., data = map(.data$data, .f, ...)) :
Caused by error in model.frame.default():
! variable lengths differ (found for 'group')
How can I run Levene's test for all columns in my df without just repeating the same code over and over? I'm new to R, so maybe I'm trying in a too pythonist kind of way and should use something like lapply() instead?
Create the formula with reformulate as the scale will be quoted string and thus, it needs the formula to be constructed either with reformulate or paste
for (scale in subscales) {
ders %>%
group_by(session) %>%
levene_test(reformulate('group', response = scale)) %>%
kable(caption = scale) %>% print()
}
This maybe also done with across
library(dplyr)
library(stringr)
library(tidyr)
library(rstatix)
data(mtcars)
mtcars %>%
mutate(carb = factor(carb)) %>%
group_by(cyl) %>%
summarise(across(c(mpg, disp),
~ levene_test(cur_data(),
reformulate('carb', response = cur_column())) %>%
rename_with(~ str_c(cur_column(), .x), everything()) )) %>%
unpack(where(is.tibble))
-output
# A tibble: 3 × 9
cyl mpgdf1 mpgdf2 mpgstatistic mpgp dispdf1 dispdf2 dispstatistic dispp
<dbl> <int> <int> <dbl> <dbl> <int> <int> <dbl> <dbl>
1 4 1 9 0.975 0.349 1 9 1.32e- 1 7.24e- 1
2 6 2 4 2.52 0.196 2 4 7.44e+29 7.23e-60
3 8 3 10 1.60 0.251 3 10 1.18e+ 1 1.27e- 3

Extract restricted regression coefficients without using tidy

I'm using the restriktor package to perform restricted regressions; however, at the same time I'm doing the restricted regressions by group using the dplyr. In order to extract the coefficients and have them formatted into a nice panel format, I use tidy and broom but the tidy packaged doesn't work on the restriktor so I'm not sure how to go about extracting the coefficients:
library(restriktor)
library(dplyr)
reg =
mtcars %>%
group_by(cyl) %>%
do(model = restriktor(lm(mpg ~ wt + hp, data =.), constraints = ' wt < -4 '))
I would like to have the b.restr which is the restricted model coefficients to be extracted for each group and formatted together into a panel normally I would use the following:
reg =
mtcars %>%
group_by(cyl) %>%
do({model = restriktor(lm(mpg ~ wt + hp, data =.), constraints = ' wt < -4 ') # create your model
data.frame(tidy(model), # get coefficient info
glance(model))})
But I get the following error:
Error: No tidy method for objects of class restriktor
All I want is to extract the following elements from the lists and put them altogether with their group identifier in one panel format:
reg[[2]][[1]][["b.restr"]]
Use group_modify (which is preferred over do now) with coef/as.list/as_tibble.
library(dplyr)
library(restriktor)
# coefficients and R2's or NAs if too few rows for restriktor
co <- function(fo, data) {
fm <- lm(fo, data)
coef_lm <- coef(fm)
min_rows <- length(coef_lm)
if (nrow(data) <= min_rows) NA * c(coef_lm, R2.org = NA, R2.reduced = NA)
else {
r <- restriktor(fm, constraints = ' wt < -4 ')
c(coef(r), R2.org = r$R2.org, R2.reduced = r$R2.reduced)
}
}
mtcars %>%
group_by(cyl) %>%
group_modify(~ {
.x %>%
co(mpg ~ wt + hp, .) %>%
as.list %>%
as_tibble
}) %>%
ungroup
giving:
tibble: 3 x 6
cyl `(Intercept)` wt hp R2.org R2.reduced
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 45.8 -5.12 -0.0905 0.681 0.681
2 6 35.3 -4 -0.0256 0.589 0.667
3 8 33.9 -4 -0.0132 0.497 0.652

Keep identifying features when using map() to "spread_predictions" in R?

I am trying to build a linear model, and then make predictions with new data based on that linear model. The following chunk of code takes a given set of data (data1), and produces 20 models based on the fact that when I group by ID and plot, there are 20 groups:
modelobject <- data_1 %>%
group_by(ID, plot) %>%
do(model = lm(air_temp ~ water_temp, data = .)) %>%
ungroup()
Now that the model is designed, I want to use the map() function to make predictions across a new set of data (data_2) for each of those models:
modelled_values <- map(modelobject$model, ~ spread_predictions(data = data_2, models = .x))
This works great, except for the fact that the subsequent object modelled_values doesn't have the identifying features of the original models (i.e. their given ID and plot) as can be seen in the following output for the Value column (it produces 11 columns, none of which are identifying features):
Value
List of length 20
A data.frame with 52606 rows and 11 columns
....
I have ended up having to assume that they are just in the order I produced them in and manually label each model object with the following style of code:
modelled_values[[1]]$ID <- "ID1"
modelled_values[[2]]$ID <- "ID1"
modelled_values[[3]]$ID <- "ID2"
modelled_values[[4]]$ID <- "ID2"
...
Is there any way I can carry the identifying features of the original models over to these predicted data?
What about something like this:
modelobject <- mtcars %>%
group_by(vs, am) %>%
do(model = lm(mpg ~ hp, data = .))
preds <- modelobject %>%
group_by(vs, am) %>%
rowwise %>%
summarise(preds = list(predict(model, newdata=mtcars)))
preds
# # A tibble: 4 x 3
# # Groups: vs, am [4]
# vs am preds
# <dbl> <dbl> <list>
# 1 0 0 <dbl [32]>
# 2 0 1 <dbl [32]>
# 3 1 0 <dbl [32]>
# 4 1 1 <dbl [32]>
In the code above, preds is now a tibble with a column called preds where each element is as vector of predictions from the model for the relevant vs and am values in the row.

Executing a statistical test across multiple subsets using purrr map

I'm trying to use purr map functionality to create a number of sub-groups from a dataframe so that I can run a statistical test on each sub-group. So using mtcars as a sample data set, I can determine the set of unique carb values from:
mtcars %>% {unique(.$carb)}
gives [1] 4 1 2 3 6 8
Similarly, the set of unique gear values:
mtcars %>% {unique(.$gear)}
gives [1] 4 3 5
I'd like to iterate through the unique combinations of carb and gear and use this as a way to subset values within mtcars, so that I can perform a statistical test on each subset (as indexed by gear and carb). So the test would be:
data_subset %>% kruskal.test(.$mpg, .$am, data = .)
I've tried to do this using map from purrr. Something along the lines of:
library(purrr)
mtcars %>%
{unique(.$carb)} %>%
map2(mtcars, ~filter(.y, am == .x))
For most combinations of carb/gear in mtcars, there is only one value of am. From my limited understanding of the help & error messages, you need multiple groups (am in you example) to run the test.
library(tidyverse)
# Step 1 - limit to testable data
count(mtcars, carb, gear, am) %>%
count(carb, gear) %>% # Count am possibilities w/in each carb/gear group
filter(n > 1) %>%
left_join(mtcars) -> mtcars_mult_am
# Step 2 - nest, map each group to test, unnest
mtcars_mult_am %>%
nest(data = -c(carb, gear)) %>%
mutate(kruskal_raw = map(data, ~ kruskal.test(.x$mpg, .x$am)),
kruskal = map(kruskal_raw, broom::tidy)) %>%
select(-data) %>%
unnest(kruskal)
# A tibble: 2 x 7
carb gear kruskal_raw statistic p.value parameter method
<dbl> <dbl> <list> <dbl> <dbl> <int> <chr>
1 2 4 <S3: htest> 0 1 1 Kruskal-Wallis rank sum test
2 4 4 <S3: htest> 2.67 0.102 1 Kruskal-Wallis rank sum test

ANOVA F statistic value in dplyr group by

I can summarize the mean by groups using
t(mtcars %>%
group_by(gear) %>%
dplyr::summarize(Mean_Mpg = mean(mpg, na.rm=TRUE),
StdD_Mpg = sd(mpg, na.rm=TRUE)
))
gear 3 4 5
Mean_Mpg 16.106667 24.533333 21.380000
StdD_Mpg 3.371618 5.276764 6.658979
I know summary(aov(gear ~ mpg , mtcars)) will output the results from ANOVA test includign the F Statistic.
Df Sum Sq Mean Sq F value Pr(>F)
mpg 1 3.893 3.893 8.995 0.0054 **
Residuals 30 12.982 0.433
Also chisq.test(table(mtcars$gear,mtcars$carb)) will output the results from Chi.Square test.
Pearson's Chi-squared test
X-squared = 16.518, df = 10, p-value = 0.08573
What I am trying to do is produce an output like this below, where I am combining the mean, standard deviation and F Statistic value from ANOVA, X-Squared test statistic.
gear 3 4 5 Test-Statistic Test
Mpg (Mean) 16.106667 24.533333 21.380000 8.995 ANOVA
(StdD) 3.371618 5.276764 6.658979
Carb(N) 16.518 Chi.Square
3 4 0
4 4 2
3 0 0
5 4 1
0 0 1
0 0 1
I am not sure how to do put together a table like this this by combining the mean,standard deviation, F Statistic, Chiq.Square statistic values etc. I would welcome any help from the community on formatting the results like this.
One option is to think about all the results you want, and how to manipulate them in order to have a same structure. Then, use bind_rows() for instance, to gather all results in a same table.
The functions group_by() and summarise() able to calculate mean (and others) for severals variables (and the result is a data.frame), whereas the function apply() allow to apply a same function, or a combinaison of functions (like summary(aov(...))) to several variables. The result of the second is a vector.
library(tidyverse)
# mean (± sd) of x per group
mtcars %>%
group_by(gear) %>%
summarise_at(
vars(mpg, carb),
funs(paste0(round(mean(.), 2), '(±', round(sd(.) / sqrt(n()), 1), ')'))
) %>%
mutate(gear = as.character(gear)) %>%
# add ANOVA: gear ~ x
bind_rows(
c(gear = 'ANOVA',
apply(mtcars %>% select(mpg, carb), 2,
function(x) summary(aov(mtcars$gear ~ x))[[1]]$`F value`[1] %>% round(3) %>% as.character()
))
) %>%
# add Chi-Square: gear ~ x
bind_rows(
c(gear = 'CHI-SQUARE',
apply(mtcars %>% select(mpg, carb), 2,
function(x) chisq.test(table(mtcars$gear, x))$statistic %>% round(3) %>% as.character()
))
)
# # A tibble: 5 x 3
# gear mpg carb
# <chr> <chr> <chr>
# 1 3 16.11(±0.9) 2.67(±0.3)
# 2 4 24.53(±1.5) 2.33(±0.4)
# 3 5 21.38(±3) 4.4(±1.2)
# 4 ANOVA 8.995 2.436
# 5 CHI-SQUARE 54.667 16.518

Resources