ANOVA F statistic value in dplyr group by - r

I can summarize the mean by groups using
t(mtcars %>%
group_by(gear) %>%
dplyr::summarize(Mean_Mpg = mean(mpg, na.rm=TRUE),
StdD_Mpg = sd(mpg, na.rm=TRUE)
))
gear 3 4 5
Mean_Mpg 16.106667 24.533333 21.380000
StdD_Mpg 3.371618 5.276764 6.658979
I know summary(aov(gear ~ mpg , mtcars)) will output the results from ANOVA test includign the F Statistic.
Df Sum Sq Mean Sq F value Pr(>F)
mpg 1 3.893 3.893 8.995 0.0054 **
Residuals 30 12.982 0.433
Also chisq.test(table(mtcars$gear,mtcars$carb)) will output the results from Chi.Square test.
Pearson's Chi-squared test
X-squared = 16.518, df = 10, p-value = 0.08573
What I am trying to do is produce an output like this below, where I am combining the mean, standard deviation and F Statistic value from ANOVA, X-Squared test statistic.
gear 3 4 5 Test-Statistic Test
Mpg (Mean) 16.106667 24.533333 21.380000 8.995 ANOVA
(StdD) 3.371618 5.276764 6.658979
Carb(N) 16.518 Chi.Square
3 4 0
4 4 2
3 0 0
5 4 1
0 0 1
0 0 1
I am not sure how to do put together a table like this this by combining the mean,standard deviation, F Statistic, Chiq.Square statistic values etc. I would welcome any help from the community on formatting the results like this.

One option is to think about all the results you want, and how to manipulate them in order to have a same structure. Then, use bind_rows() for instance, to gather all results in a same table.
The functions group_by() and summarise() able to calculate mean (and others) for severals variables (and the result is a data.frame), whereas the function apply() allow to apply a same function, or a combinaison of functions (like summary(aov(...))) to several variables. The result of the second is a vector.
library(tidyverse)
# mean (± sd) of x per group
mtcars %>%
group_by(gear) %>%
summarise_at(
vars(mpg, carb),
funs(paste0(round(mean(.), 2), '(±', round(sd(.) / sqrt(n()), 1), ')'))
) %>%
mutate(gear = as.character(gear)) %>%
# add ANOVA: gear ~ x
bind_rows(
c(gear = 'ANOVA',
apply(mtcars %>% select(mpg, carb), 2,
function(x) summary(aov(mtcars$gear ~ x))[[1]]$`F value`[1] %>% round(3) %>% as.character()
))
) %>%
# add Chi-Square: gear ~ x
bind_rows(
c(gear = 'CHI-SQUARE',
apply(mtcars %>% select(mpg, carb), 2,
function(x) chisq.test(table(mtcars$gear, x))$statistic %>% round(3) %>% as.character()
))
)
# # A tibble: 5 x 3
# gear mpg carb
# <chr> <chr> <chr>
# 1 3 16.11(±0.9) 2.67(±0.3)
# 2 4 24.53(±1.5) 2.33(±0.4)
# 3 5 21.38(±3) 4.4(±1.2)
# 4 ANOVA 8.995 2.436
# 5 CHI-SQUARE 54.667 16.518

Related

Running Levene's test for each column of a df in R

I have a data frame containing scores of several sub-scales of the same test (columns: participant, session, group, total score, one column per sub-scale). I am trying to run assumption checks for a two-way mixed ANOVA for each sub-scale. For convenience, I would like to write one loop per assumption check, that gives me the output for all sub-scales. This worked well for checking outliers, running Box's M test and for generating the actual ANOVA output. However, I get an error when trying the same thing with Levene's test. See code and errors below:
subscales <- c("awareness", "clarity", "impulse", "goals", "nonacceptance",
"strategies") # these correspond to the column names in the df
for (scale in subscales) {
ders %>%
group_by(session) %>%
levene_test(scale ~ group) %>%
kable(caption = scale) %>% print()
}
Error in mutate(., data = map(.data$data, .f, ...)) :
Caused by error in model.frame.default():
! variable lengths differ (found for 'group')
How can I run Levene's test for all columns in my df without just repeating the same code over and over? I'm new to R, so maybe I'm trying in a too pythonist kind of way and should use something like lapply() instead?
Create the formula with reformulate as the scale will be quoted string and thus, it needs the formula to be constructed either with reformulate or paste
for (scale in subscales) {
ders %>%
group_by(session) %>%
levene_test(reformulate('group', response = scale)) %>%
kable(caption = scale) %>% print()
}
This maybe also done with across
library(dplyr)
library(stringr)
library(tidyr)
library(rstatix)
data(mtcars)
mtcars %>%
mutate(carb = factor(carb)) %>%
group_by(cyl) %>%
summarise(across(c(mpg, disp),
~ levene_test(cur_data(),
reformulate('carb', response = cur_column())) %>%
rename_with(~ str_c(cur_column(), .x), everything()) )) %>%
unpack(where(is.tibble))
-output
# A tibble: 3 × 9
cyl mpgdf1 mpgdf2 mpgstatistic mpgp dispdf1 dispdf2 dispstatistic dispp
<dbl> <int> <int> <dbl> <dbl> <int> <int> <dbl> <dbl>
1 4 1 9 0.975 0.349 1 9 1.32e- 1 7.24e- 1
2 6 2 4 2.52 0.196 2 4 7.44e+29 7.23e-60
3 8 3 10 1.60 0.251 3 10 1.18e+ 1 1.27e- 3

Extract restricted regression coefficients without using tidy

I'm using the restriktor package to perform restricted regressions; however, at the same time I'm doing the restricted regressions by group using the dplyr. In order to extract the coefficients and have them formatted into a nice panel format, I use tidy and broom but the tidy packaged doesn't work on the restriktor so I'm not sure how to go about extracting the coefficients:
library(restriktor)
library(dplyr)
reg =
mtcars %>%
group_by(cyl) %>%
do(model = restriktor(lm(mpg ~ wt + hp, data =.), constraints = ' wt < -4 '))
I would like to have the b.restr which is the restricted model coefficients to be extracted for each group and formatted together into a panel normally I would use the following:
reg =
mtcars %>%
group_by(cyl) %>%
do({model = restriktor(lm(mpg ~ wt + hp, data =.), constraints = ' wt < -4 ') # create your model
data.frame(tidy(model), # get coefficient info
glance(model))})
But I get the following error:
Error: No tidy method for objects of class restriktor
All I want is to extract the following elements from the lists and put them altogether with their group identifier in one panel format:
reg[[2]][[1]][["b.restr"]]
Use group_modify (which is preferred over do now) with coef/as.list/as_tibble.
library(dplyr)
library(restriktor)
# coefficients and R2's or NAs if too few rows for restriktor
co <- function(fo, data) {
fm <- lm(fo, data)
coef_lm <- coef(fm)
min_rows <- length(coef_lm)
if (nrow(data) <= min_rows) NA * c(coef_lm, R2.org = NA, R2.reduced = NA)
else {
r <- restriktor(fm, constraints = ' wt < -4 ')
c(coef(r), R2.org = r$R2.org, R2.reduced = r$R2.reduced)
}
}
mtcars %>%
group_by(cyl) %>%
group_modify(~ {
.x %>%
co(mpg ~ wt + hp, .) %>%
as.list %>%
as_tibble
}) %>%
ungroup
giving:
tibble: 3 x 6
cyl `(Intercept)` wt hp R2.org R2.reduced
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 45.8 -5.12 -0.0905 0.681 0.681
2 6 35.3 -4 -0.0256 0.589 0.667
3 8 33.9 -4 -0.0132 0.497 0.652

Is it possible to remove NAs for each pairwise comparison in correlation() function of correlation package in R?

I'd like to know how to remove NAs for pairwise comparisons in the correlation() function of the correlation package in R. Other alternatives are welcome. I'm aware of rcorr() from the Hmisc package, but I need the output in long (tidy) format.
This would be the equivalent of cor(x, use = 'pairwise.complete.obs').
As I need both the p.value and estimate, cor() is not suitable, and painfully cor.test() doesn't have use = 'pairwise.complete.obs' as a parameter.
Specifically, due to the large nature of the data cor.test(x, na.action = 'na.omit') excessively removes entries from the Pearson correlation analysis, which is why I want this to be performed according to the pairwise comparisons rather than across the entire dataset.
Ok, so just for the fun of it. Using the corrr package would give you some nice tidy data options, i.e. you can get the correlations a) in tidy format and b) in long format. It can also give you the number of pairwise complete obs (pair_n).
And from there, it's relatively easy to a) calculate the t-value for the correlation being different from zero and b) the corresponding p-value. Note in my comment above, I assumed you wanted to calculate the difference between two correlations. However, I think you just want the normal p-value of the correlations.
1. Create a toy data set with missings:
set.seed(1)
mtcars_NA <- mtcars %>%
mutate(across(everything(), ~ if_else(row_number() %in% sample(1:32, 5), NA_real_, .)))
2. Calculate the correlations, append with the sample size and get t/p values
library(tidyverse)
library(corrr)
mtcars_NA %>%
correlate() %>%
shave() %>%
stretch() %>%
filter(!is.na(r)) %>%
left_join(mtcars_NA %>%
pair_n %>%
as.data.frame() %>%
rownames_to_column("x") %>%
pivot_longer(-x,
values_to = "n",
names_to = "y"),
by = c("x", "y")) %>%
mutate(t_value = r / sqrt((1 - r^2) / (n -2)),
p_value = 2*pt(q = abs(t_value), df = n-2, lower.tail = FALSE))
which gives:
# A tibble: 55 x 6
x y r n t_value p_value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 mpg cyl -0.851 22 -7.23 0.000000534
2 mpg disp -0.864 23 -7.87 0.000000107
3 mpg hp -0.785 23 -5.80 0.00000929
4 mpg drat 0.684 22 4.19 0.000449
5 mpg wt -0.882 24 -8.78 0.0000000122
6 mpg qsec 0.434 23 2.21 0.0385
7 mpg vs 0.742 23 5.07 0.0000511
8 mpg am 0.549 22 2.94 0.00814
9 mpg gear 0.476 23 2.48 0.0218
10 mpg carb -0.640 23 -3.81 0.00101
# ... with 45 more rows
3. Let's compare the first correlation to the cor.test function
cor.test(mtcars_NA$cyl, mtcars_NA$mpg)
which gives:
Pearson's product-moment correlation
data: mtcars_NA$cyl and mtcars_NA$mpg
t = -7.2326, df = 20, p-value = 5.337e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9363704 -0.6687359
sample estimates:
cor
-0.8505393
So that's the same result.

Executing a statistical test across multiple subsets using purrr map

I'm trying to use purr map functionality to create a number of sub-groups from a dataframe so that I can run a statistical test on each sub-group. So using mtcars as a sample data set, I can determine the set of unique carb values from:
mtcars %>% {unique(.$carb)}
gives [1] 4 1 2 3 6 8
Similarly, the set of unique gear values:
mtcars %>% {unique(.$gear)}
gives [1] 4 3 5
I'd like to iterate through the unique combinations of carb and gear and use this as a way to subset values within mtcars, so that I can perform a statistical test on each subset (as indexed by gear and carb). So the test would be:
data_subset %>% kruskal.test(.$mpg, .$am, data = .)
I've tried to do this using map from purrr. Something along the lines of:
library(purrr)
mtcars %>%
{unique(.$carb)} %>%
map2(mtcars, ~filter(.y, am == .x))
For most combinations of carb/gear in mtcars, there is only one value of am. From my limited understanding of the help & error messages, you need multiple groups (am in you example) to run the test.
library(tidyverse)
# Step 1 - limit to testable data
count(mtcars, carb, gear, am) %>%
count(carb, gear) %>% # Count am possibilities w/in each carb/gear group
filter(n > 1) %>%
left_join(mtcars) -> mtcars_mult_am
# Step 2 - nest, map each group to test, unnest
mtcars_mult_am %>%
nest(data = -c(carb, gear)) %>%
mutate(kruskal_raw = map(data, ~ kruskal.test(.x$mpg, .x$am)),
kruskal = map(kruskal_raw, broom::tidy)) %>%
select(-data) %>%
unnest(kruskal)
# A tibble: 2 x 7
carb gear kruskal_raw statistic p.value parameter method
<dbl> <dbl> <list> <dbl> <dbl> <int> <chr>
1 2 4 <S3: htest> 0 1 1 Kruskal-Wallis rank sum test
2 4 4 <S3: htest> 2.67 0.102 1 Kruskal-Wallis rank sum test

Pass a list of variable names to a function using {{foo}}

Problem
I would like to know how to pass a list of variable names to a purrr::map2 function for the purpose of iterating over a separate data frame.
The input_table$key variable below contains mpg and disp from the mtcars dataset. I think the names of the variables are being passed as character strings rather than variable names. The question is how I can change that so that my function recognises that they are variable names(?).
In this example I am trying to sum all of the values in the mtcars variables mpg and disp that fall below a set of numeric thresholds. Those variables from mtcars and the relevant thresholds are contained in input_table (below).
Ideal result
percentile key value sum_y
<fct> <chr> <dbl> <dbl>
1 0.5 mpg 19.2 266.5
2 0.9 mpg 30.1 515.8
3 0.99 mpg 33.4 609.0
4 1 mpg 33.9 642.9
5 ... ... ... ...
Attempt
library(dplyr)
library(purrr)
library(tidyr)
# Arrange a generic example
# Replicating my data structure
input_table <- mtcars %>%
as_tibble() %>%
select(mpg, disp) %>%
map_df(quantile, probs = c(0.5, 0.90, 0.99, 1)) %>%
mutate(
percentile = factor(c(0.5, 0.90, 0.99, 1))
) %>%
select(
percentile, mpg, disp
) %>%
gather(key, value, -percentile)
# Defining the function
test_func <- function(label_desc, threshold) {
mtcars %>%
select({{label_desc}}) %>%
filter({{label_desc}} <= {{threshold}}) %>%
summarise(
sum_y = sum(as.numeric({{label_desc}}), na.rm = T)
)
}
# Demo'ing that it works for a single variable and threshold value
test_func(label_desc = mpg, threshold = 19.2)
# This is where I am having trouble
# Trying to iterate over multiple (mpg, disp) variables
map2(input_table$key, input_table$value, ~test_func(label_desc = .x, threshold = .y))
The issue is curly-curly ({{}}) is used for unquoted variables as you are using in your first attempt. In your second attempt you are passing quoted variables to which the curly-curly operator does not work. A simple fix would be to use _at variants of dplyr which accepts quoted arguments.
test_func <- function(label_desc, threshold) {
mtcars %>%
filter_at(label_desc, any_vars(. <= threshold)) %>%
summarise_at(label_desc, sum)
}
purrr::map2(input_table$key, input_table$value, test_func)
#[[1]]
# mpg
#1 266.5
#[[2]]
# mpg
#1 515.8
#[[3]]
# mpg
#1 609
#[[4]]
# mpg
#1 642.9
#[[5]]
# disp
#1 1956.7
#.....

Resources