Executing a statistical test across multiple subsets using purrr map - r

I'm trying to use purr map functionality to create a number of sub-groups from a dataframe so that I can run a statistical test on each sub-group. So using mtcars as a sample data set, I can determine the set of unique carb values from:
mtcars %>% {unique(.$carb)}
gives [1] 4 1 2 3 6 8
Similarly, the set of unique gear values:
mtcars %>% {unique(.$gear)}
gives [1] 4 3 5
I'd like to iterate through the unique combinations of carb and gear and use this as a way to subset values within mtcars, so that I can perform a statistical test on each subset (as indexed by gear and carb). So the test would be:
data_subset %>% kruskal.test(.$mpg, .$am, data = .)
I've tried to do this using map from purrr. Something along the lines of:
library(purrr)
mtcars %>%
{unique(.$carb)} %>%
map2(mtcars, ~filter(.y, am == .x))

For most combinations of carb/gear in mtcars, there is only one value of am. From my limited understanding of the help & error messages, you need multiple groups (am in you example) to run the test.
library(tidyverse)
# Step 1 - limit to testable data
count(mtcars, carb, gear, am) %>%
count(carb, gear) %>% # Count am possibilities w/in each carb/gear group
filter(n > 1) %>%
left_join(mtcars) -> mtcars_mult_am
# Step 2 - nest, map each group to test, unnest
mtcars_mult_am %>%
nest(data = -c(carb, gear)) %>%
mutate(kruskal_raw = map(data, ~ kruskal.test(.x$mpg, .x$am)),
kruskal = map(kruskal_raw, broom::tidy)) %>%
select(-data) %>%
unnest(kruskal)
# A tibble: 2 x 7
carb gear kruskal_raw statistic p.value parameter method
<dbl> <dbl> <list> <dbl> <dbl> <int> <chr>
1 2 4 <S3: htest> 0 1 1 Kruskal-Wallis rank sum test
2 4 4 <S3: htest> 2.67 0.102 1 Kruskal-Wallis rank sum test

Related

How to mask certain cell's info from a gtsummary table

I have some sensitive info in my data that I need to hide below a certain threshold (to comply with DUA and prevent reidentifying data). I'm using tbl_svysmmary() from gtsummary. In my example, I like to filter "cell size ≤ 100":
library(gtsummary)
library(survey)
tbl_svysummary <-
svydesign(~1, data = as.data.frame(Titanic), weights = ~Freq) %>%
tbl_svysummary(by = Survived, percent = "row", include = c(Class, Age))
tbl_svysummary
I want to show child info as:
You can do more tidyverse manipulation on the output object like this (using the table_body):
library(tidyverse)
data(mtcars)
library(gtsummary)
output <- mtcars[,1:2] %>% tbl_summary()
output$table_body
# A tibble: 5 × 6
variable var_type var_label row_type label stat_0
<chr> <chr> <chr> <chr> <chr> <chr>
1 mpg continuous mpg label mpg 19.2 (15.4, 22.8)
2 cyl categorical cyl label cyl NA
3 cyl categorical cyl level 4 11 (34%)
4 cyl categorical cyl level 6 7 (22%)
5 cyl categorical cyl level 8 14 (44%)
# Don't show cell information when N of cyl is less than 10
output$table_body <- output$table_body %>%
mutate(extra = stat_0) %>%
separate(extra, c("number"), sep = ' \\(') %>%
mutate(number = as.numeric(number)) %>%
mutate(stat_0 = case_when(number < 10 & var_type == "categorical" ~ "TOO FEW",
TRUE ~ stat_0)) %>%
select(!number)
output
The filter decision in case_when is based on categorical variables, where you like to control the cell information, when it is less than a threshold value. You can adjust this for any other variable or condition.

Running Levene's test for each column of a df in R

I have a data frame containing scores of several sub-scales of the same test (columns: participant, session, group, total score, one column per sub-scale). I am trying to run assumption checks for a two-way mixed ANOVA for each sub-scale. For convenience, I would like to write one loop per assumption check, that gives me the output for all sub-scales. This worked well for checking outliers, running Box's M test and for generating the actual ANOVA output. However, I get an error when trying the same thing with Levene's test. See code and errors below:
subscales <- c("awareness", "clarity", "impulse", "goals", "nonacceptance",
"strategies") # these correspond to the column names in the df
for (scale in subscales) {
ders %>%
group_by(session) %>%
levene_test(scale ~ group) %>%
kable(caption = scale) %>% print()
}
Error in mutate(., data = map(.data$data, .f, ...)) :
Caused by error in model.frame.default():
! variable lengths differ (found for 'group')
How can I run Levene's test for all columns in my df without just repeating the same code over and over? I'm new to R, so maybe I'm trying in a too pythonist kind of way and should use something like lapply() instead?
Create the formula with reformulate as the scale will be quoted string and thus, it needs the formula to be constructed either with reformulate or paste
for (scale in subscales) {
ders %>%
group_by(session) %>%
levene_test(reformulate('group', response = scale)) %>%
kable(caption = scale) %>% print()
}
This maybe also done with across
library(dplyr)
library(stringr)
library(tidyr)
library(rstatix)
data(mtcars)
mtcars %>%
mutate(carb = factor(carb)) %>%
group_by(cyl) %>%
summarise(across(c(mpg, disp),
~ levene_test(cur_data(),
reformulate('carb', response = cur_column())) %>%
rename_with(~ str_c(cur_column(), .x), everything()) )) %>%
unpack(where(is.tibble))
-output
# A tibble: 3 × 9
cyl mpgdf1 mpgdf2 mpgstatistic mpgp dispdf1 dispdf2 dispstatistic dispp
<dbl> <int> <int> <dbl> <dbl> <int> <int> <dbl> <dbl>
1 4 1 9 0.975 0.349 1 9 1.32e- 1 7.24e- 1
2 6 2 4 2.52 0.196 2 4 7.44e+29 7.23e-60
3 8 3 10 1.60 0.251 3 10 1.18e+ 1 1.27e- 3

Keep identifying features when using map() to "spread_predictions" in R?

I am trying to build a linear model, and then make predictions with new data based on that linear model. The following chunk of code takes a given set of data (data1), and produces 20 models based on the fact that when I group by ID and plot, there are 20 groups:
modelobject <- data_1 %>%
group_by(ID, plot) %>%
do(model = lm(air_temp ~ water_temp, data = .)) %>%
ungroup()
Now that the model is designed, I want to use the map() function to make predictions across a new set of data (data_2) for each of those models:
modelled_values <- map(modelobject$model, ~ spread_predictions(data = data_2, models = .x))
This works great, except for the fact that the subsequent object modelled_values doesn't have the identifying features of the original models (i.e. their given ID and plot) as can be seen in the following output for the Value column (it produces 11 columns, none of which are identifying features):
Value
List of length 20
A data.frame with 52606 rows and 11 columns
....
I have ended up having to assume that they are just in the order I produced them in and manually label each model object with the following style of code:
modelled_values[[1]]$ID <- "ID1"
modelled_values[[2]]$ID <- "ID1"
modelled_values[[3]]$ID <- "ID2"
modelled_values[[4]]$ID <- "ID2"
...
Is there any way I can carry the identifying features of the original models over to these predicted data?
What about something like this:
modelobject <- mtcars %>%
group_by(vs, am) %>%
do(model = lm(mpg ~ hp, data = .))
preds <- modelobject %>%
group_by(vs, am) %>%
rowwise %>%
summarise(preds = list(predict(model, newdata=mtcars)))
preds
# # A tibble: 4 x 3
# # Groups: vs, am [4]
# vs am preds
# <dbl> <dbl> <list>
# 1 0 0 <dbl [32]>
# 2 0 1 <dbl [32]>
# 3 1 0 <dbl [32]>
# 4 1 1 <dbl [32]>
In the code above, preds is now a tibble with a column called preds where each element is as vector of predictions from the model for the relevant vs and am values in the row.

ANOVA F statistic value in dplyr group by

I can summarize the mean by groups using
t(mtcars %>%
group_by(gear) %>%
dplyr::summarize(Mean_Mpg = mean(mpg, na.rm=TRUE),
StdD_Mpg = sd(mpg, na.rm=TRUE)
))
gear 3 4 5
Mean_Mpg 16.106667 24.533333 21.380000
StdD_Mpg 3.371618 5.276764 6.658979
I know summary(aov(gear ~ mpg , mtcars)) will output the results from ANOVA test includign the F Statistic.
Df Sum Sq Mean Sq F value Pr(>F)
mpg 1 3.893 3.893 8.995 0.0054 **
Residuals 30 12.982 0.433
Also chisq.test(table(mtcars$gear,mtcars$carb)) will output the results from Chi.Square test.
Pearson's Chi-squared test
X-squared = 16.518, df = 10, p-value = 0.08573
What I am trying to do is produce an output like this below, where I am combining the mean, standard deviation and F Statistic value from ANOVA, X-Squared test statistic.
gear 3 4 5 Test-Statistic Test
Mpg (Mean) 16.106667 24.533333 21.380000 8.995 ANOVA
(StdD) 3.371618 5.276764 6.658979
Carb(N) 16.518 Chi.Square
3 4 0
4 4 2
3 0 0
5 4 1
0 0 1
0 0 1
I am not sure how to do put together a table like this this by combining the mean,standard deviation, F Statistic, Chiq.Square statistic values etc. I would welcome any help from the community on formatting the results like this.
One option is to think about all the results you want, and how to manipulate them in order to have a same structure. Then, use bind_rows() for instance, to gather all results in a same table.
The functions group_by() and summarise() able to calculate mean (and others) for severals variables (and the result is a data.frame), whereas the function apply() allow to apply a same function, or a combinaison of functions (like summary(aov(...))) to several variables. The result of the second is a vector.
library(tidyverse)
# mean (± sd) of x per group
mtcars %>%
group_by(gear) %>%
summarise_at(
vars(mpg, carb),
funs(paste0(round(mean(.), 2), '(±', round(sd(.) / sqrt(n()), 1), ')'))
) %>%
mutate(gear = as.character(gear)) %>%
# add ANOVA: gear ~ x
bind_rows(
c(gear = 'ANOVA',
apply(mtcars %>% select(mpg, carb), 2,
function(x) summary(aov(mtcars$gear ~ x))[[1]]$`F value`[1] %>% round(3) %>% as.character()
))
) %>%
# add Chi-Square: gear ~ x
bind_rows(
c(gear = 'CHI-SQUARE',
apply(mtcars %>% select(mpg, carb), 2,
function(x) chisq.test(table(mtcars$gear, x))$statistic %>% round(3) %>% as.character()
))
)
# # A tibble: 5 x 3
# gear mpg carb
# <chr> <chr> <chr>
# 1 3 16.11(±0.9) 2.67(±0.3)
# 2 4 24.53(±1.5) 2.33(±0.4)
# 3 5 21.38(±3) 4.4(±1.2)
# 4 ANOVA 8.995 2.436
# 5 CHI-SQUARE 54.667 16.518

Using dplyr to create summary proportion table with several categorical/factor variables

I am trying to create one table that summarizes several categorical variables (using frequencies and proportions) by another variable. I would like to do this using the dplyr package.
These previous Stack Overflow discussions have partially what I am looking for:
Relative frequencies / proportions with dplyr and Calculate relative frequency for a certain group.
Using the mtcars dataset, this is what the output would look like if I just wanted to look at the proportion of gear by am category:
mtcars %>%
group_by(am, gear) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
However, I actually want to look at not only the gears by am, but also carb by am and cyl by am, separately, in the same table. If I amend the code to:
mtcars %>%
group_by (am, gear, carb, cyl) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
I get the frequencies for each combination of am, gear, carb, and cyl. Which is not what I want. Is there any way to do this with dplyr?
EDIT
Also, it would be an added bonus if anyone knew of a way to produce the table I want, but with the categories of am as the columns (as in a classic 2x2 table format). Here is an example of what i'm referring to. It is from one of my previous publications. I want to produce this table in R, so that I can output it directly to a word document using RMarkdown:
One way to solve this, is to turn your data to a long(er) format. You can then use the same code to calculate the outcomes you want, with one extra group_by:
library(reshape2)
library(dplyr)
m_mtcars <- melt(mtcars,measure.vars=c("gear","carb","cyl"))
res <- m_mtcars %>%
group_by(am, variable, value) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
Building on this, the desired output can be obtained using more reshaping and some string formatting
#make an 'export' variable
res$export <- with(res, sprintf("%i (%.1f%%)", n, freq*100))
#reshape again
output <- dcast(variable+value~am, value.var="export", data=res, fill="missing") #use drop=F to prevent silent missings
#'silent missings'
output$variable <- as.character(output$variable)
#make 'empty lines'
empties <- data.frame(variable=unique(output$variable), stringsAsFactors=F)
empties[,colnames(output)[-1]] <- ""
#bind them together
output2 <- rbind(empties,output)
output2 <- output2[order(output2$variable,output2$value),]
#optional: 'remove' variable if value present
output2$variable[output2$value!=""] <- ""
This results in:
variable value 0 1
2 carb
7 1 3 (15.8%) 4 (30.8%)
8 2 6 (31.6%) 4 (30.8%)
9 3 3 (15.8%) missing
10 4 7 (36.8%) 3 (23.1%)
11 6 missing 1 (7.7%)
12 8 missing 1 (7.7%)
3 cyl
13 4 3 (15.8%) 8 (61.5%)
14 6 4 (21.1%) 3 (23.1%)
15 8 12 (63.2%) 2 (15.4%)
1 gear
4 3 15 (78.9%) missing
5 4 4 (21.1%) 8 (61.5%)
6 5 missing 5 (38.5%)
With tidyr/dplyr combination, here is how you would do it:
library(tidyr)
library(dplyr)
mtcars %>%
gather(variable, value, gear, carb, cyl) %>%
group_by(am, variable, value) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
An alternative to group by and then summarize is to use a count().
This just makes the code 1 line more concise
library(reshape2)
library(dplyr)
m_mtcars <- melt(mtcars,measure.vars=c("gear","carb","cyl"))
res <- m_mtcars %>%
count(am, variable, value) %>%
mutate(freq = n / sum(n))
The other benefit is that this will save the other values that are lost in a group_by summarize. The resulting table looks like this
enter image description here

Resources