R: Tidying and summarising a paired comparison dataset in the tidyverse style - r

I have a dataset with features {a,b,c...} belonging to a pair of players taken form the set {a, b, c}. Each row represents the outcome of a matchup, columns name_1, name_2 represent player names, and all other columns a1, a2, b1, b2, c1, c2, etc.. represent numeric features corresponding to the player in the matchup.
Below is the example of a dataset:
set.seed(17)
df <- tibble(
name_1 = sample(letters[1:3], length(letters), replace = TRUE),
name_2 = sample(letters[1:3], length(letters), replace = TRUE),
a1 = rnorm(length(letters)),
a2 = rnorm(length(letters)),
b1 = rnorm(length(letters)),
b2 = rnorm(length(letters)),
c1 = rnorm(length(letters)),
c2 = rnorm(length(letters))) %>%
filter(!(name_1 == name_2))
What I need is to find a summary statistic for each feature grouped by player. The trouble is that the same player, for example, a, can be located sometimes under name_1, sometimes under name_2, hence his features can be located at feature1 or feature2.
Here is my feeble attempt to do this for one player (namely, a) and one feature (namely, a):
df %>%
mutate(feature_a_joined = case_when(df$name_1 == "a" ~ a1,
df$name_2 == "a" ~ a2)) %>%
summarise(mean = mean(feature_a_joined, na.rm = TRUE))
I am fairly new to R, but the examples that I`ve seen in multiple vignettes refer to more standard datasets. Is there an efficient way to make a summary for each player and each variable?
Update
My expected result would be something like this:
# A tibble: 3 x 4
player feature_a_mean feature_b_mean feature_c_mean
<chr> <dbl> <dbl> <dbl>
1 a -0.330 2.38 0.960
2 b -0.482 1.30 0.207
3 c -0.482 -0.477 -1.71

We can use map. Get the unique column names ('un1') from the data. Loop over those (map), apply the OP's code with case_when and get the mean
library(dplyr)
library(purrr)
library(stringr)
un1 <- unique(str_remove(names(df)[-(1:2)], "\\d+"))
map_dfc(un1, ~
df %>%
summarise(!! str_c('mean_', .x) :=
mean(case_when(name_1 == .x ~ !! rlang::sym(str_c(.x, '1')),
name_2 == .x ~ !! rlang::sym(str_c(.x, '2'))),
na.rm = TRUE)))
-output
# A tibble: 1 x 3
# mean_a mean_b mean_c
# <dbl> <dbl> <dbl>
#1 -0.00673 0.186 -0.0632
Update
Based on the OP's expected output (assuming the output values are placeholders), we reshape the multiple blocks of columns to 'long' format with pivot_longer, do a group by to get the summarise across columns 'a' to 'c'
library(tidyr)
df %>%
pivot_longer(everything(), names_to = c('.value', 'grp'),
names_sep= '(?<=[a-z])_?(?=[0-9])') %>%
group_by(player = name) %>%
summarise(across(a:c, mean, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 3 x 4
# player a b c
# <chr> <dbl> <dbl> <dbl>
#1 a -0.00673 0.197 0.126
#2 b -0.0455 0.186 -0.138
#3 c -0.118 -0.468 -0.0632

Related

How to combine function argument with group_by in R

I would like to use group_by( ) function with my customised function but the column names that goes within group_by would be defined in my function argument.
See a hypothetical example of what my data would look like:
data <- data.frame(ind = rep(c("A", "B", "C"), 4),
gender = rep(c("F", "M"), each = 6),
value = sample(1:100, 12))
And this is the result I would like to have:
result <- data %>%
group_by(ind, gender) %>%
mutate(value = mean(value)) %>%
distinct()
This is how I was trying to make my function to work:
myFunction <- function(data, set_group, variable){
result <- data %>%
group_by(get(set_group)) %>%
mutate(across(all_of(variable), ~ mean(.x, na.rm = TRUE))) %>%
distinct()
}
result3 <- myFunction(data, set_group = c("ind", "gender"), variable = c("value"))
result3
I want to allow that the user define as many set_group as needed and as many variable as needed. I tried using get( ) function, all_of( ) function and mget( ) function within group_by but none worked.
Does anyone know how can I code it?
Thank you!
We could use across within group_by
myFunction <- function(data, set_group, variable){
data %>%
group_by(across(all_of(set_group))) %>%
mutate(across(all_of(variable), ~ mean(.x, na.rm = TRUE))) %>%
ungroup %>%
distinct()
}
-testing
> myFunction(data, set_group = c("ind", "gender"), variable = c("value"))
# A tibble: 6 × 3
ind gender value
<chr> <chr> <dbl>
1 A F 43.5
2 B F 87.5
3 C F 67.5
4 A M 13
5 B M 43.5
6 C M 37.5
Another option is to convert to symbols and evaluate (!!!)
myFunction <- function(data, set_group, variable){
data %>%
group_by(!!! rlang::syms(set_group)) %>%
mutate(across(all_of(variable), ~ mean(.x, na.rm = TRUE))) %>%
ungroup %>%
distinct()
}
-testing
> myFunction(data, set_group = c("ind", "gender"), variable = c("value"))
# A tibble: 6 × 3
ind gender value
<chr> <chr> <dbl>
1 A F 43.5
2 B F 87.5
3 C F 67.5
4 A M 13
5 B M 43.5
6 C M 37.5
NOTE: get is used when there is a single object, for multiple objects mget can be used. But, it is better to use tidyverse functions

Perform multiple two-sample t-test using dplyr in R

I would like to perform multiple pairwise t-tests on a dataset containing about 400 different column variables and 3 subject groups, and extract p-values for every comparison. A shorter representative example of the data, using only 2 variables could be the following;
df <- tibble(var1 = rnorm(90, 1, 1), var2 = rnorm(90, 1.5, 1), group = rep(1:3, each = 30))
Ideally the end result will be a summarised data frame containing four columns; one for the variable being tested (var1, var2 etc.), two for the groups being tested every time and a final one for the p-value.
I've tried duplicating the group column in the long form, and doing a double group_by in order to do the comparisons but with no result
result <- df %>%
pivot_longer(var1:var2, "var", "value") %>%
rename(group_a = group) %>%
mutate(group_b = group_a) %>%
group_by(group_a, group_b) %>%
summarise(n = n())
We can reshape the data into 'long' format with pivot_longer, then grouped by 'group', apply the pairwise.t.test, extract the list elements and transform into tibble with tidy (from broom) and unnest the list column
library(dplyr)
library(tidyr)
library(broom)
df %>%
pivot_longer(cols = -group, names_to = 'grp') %>%
group_by(group) %>%
summarise(out = list(pairwise.t.test(value, grp
) %>%
tidy)) %>%
unnest(c(out))
-output
# A tibble: 3 x 4
group group1 group2 p.value
<int> <chr> <chr> <dbl>
1 1 var2 var1 0.0760
2 2 var2 var1 0.0233
3 3 var2 var1 0.000244
In case you end up wanting more information about the t-tests, here is an approach that will allow you to extract more information such as the degrees of freedom and value of the test statistic:
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
df <- tibble(
var1 = rnorm(90, 1, 1),
var2 = rnorm(90, 1.5, 1),
group = rep(1:3, each = 30)
)
df %>%
select(-group) %>%
names() %>%
map_dfr(~ {
y <- .
combn(3, 2) %>%
t() %>%
as.data.frame() %>%
pmap_dfr(function(V1, V2) {
df %>%
select(group, all_of(y)) %>%
filter(group %in% c(V1, V2)) %>%
t.test(as.formula(sprintf("%s ~ group", y)), ., var.equal = TRUE) %>%
tidy() %>%
transmute(y = y,
group_1 = V1,
group_2 = V2,
df = parameter,
t_value = statistic,
p_value = p.value
)
})
})
#> # A tibble: 6 x 6
#> y group_1 group_2 df t_value p_value
#> <chr> <int> <int> <dbl> <dbl> <dbl>
#> 1 var1 1 2 58 -0.337 0.737
#> 2 var1 1 3 58 -1.35 0.183
#> 3 var1 2 3 58 -1.06 0.295
#> 4 var2 1 2 58 -0.152 0.879
#> 5 var2 1 3 58 1.72 0.0908
#> 6 var2 2 3 58 1.67 0.100
And here is #akrun's answer tweaked to give the same p-values as the above approach. Note the p.adjust.method = "none" which gives independent t-tests which will inflate your Type I error rate.
df %>%
pivot_longer(
cols = -group,
names_to = "y"
) %>%
group_by(y) %>%
summarise(
out = list(
tidy(
pairwise.t.test(
value,
group,
p.adjust.method = "none",
pool.sd = FALSE
)
)
)
) %>%
unnest(c(out))
#> # A tibble: 6 x 4
#> y group1 group2 p.value
#> <chr> <chr> <chr> <dbl>
#> 1 var1 2 1 0.737
#> 2 var1 3 1 0.183
#> 3 var1 3 2 0.295
#> 4 var2 2 1 0.879
#> 5 var2 3 1 0.0909
#> 6 var2 3 2 0.100
Created on 2021-07-30 by the reprex package (v1.0.0)

Un-paired t-test using data within one column

I want to do an unpaired t-test to examine if values differ between sites in each type category.
So my question is, within types (AB or CD), do values (valueA or valueB) differ between sites (A or B)?
Here is an example of my data:
dat <- data.frame(
"site" = c("A","B","B","A","A","B","B","A"),
"type" = c("AB","CD"),
"valueA" = c(13,-10,-5,18,-14,12,-17,19),
"valueB" = c(-3,20,15,-16,12,15,-11,14)
)
dat
site type valueA valueB
A AB 13 -3
B CD -10 20
B AB -5 15
A CD 18 -16
A AB -14 12
B CD 12 15
B AB -17 -11
A CD 19 14
I am trying to do four unpaired t-tests to examine:
If valueA Type AB, differs between site A vs. site B
If valueB Type AB, differs between site A vs. site B
If valueA Type CD, differs between site A vs. site B
If valueB Type CD, differs between site A vs. site B
In order to run the unpaired t-test, I believe I need to re-arrange my data so that type AB and type CB and site A and site B are each a column (instead of being within the type or site column).
EDIT:
Using the suggested code in the comments:
library(dplyr)
d %>%
group_by(site, type) %>%
summarise(pval = t.test(valueA, valueB)$p.value)
The output is this:
site type pval
A AB 0.784
A CD 0.417
B AB 0.492
B CD 0.365
To my understanding, this p-value here is giving me the difference between valueA and valueB.
I am looking for, for example:
The difference between site A and site B of valueA in type CD.
So if I am thinking correctly, the output of the t-test should have a column for type, value A and value B. Then the p-values are for the differences between sites.
Similar to this:
type valueA valueB
AB 0.365 0.784
CD 0.492 0.417
Does this make sense?
We can do a group_by 'site', 'type' and apply the t.test
library(dplyr)
out <- dat %>%
group_by(site, type) %>%
summarise(pval = t.test(valueA, valueB)$p.value)
By default, paired = FALSE in t.test
The output above can be reshaped to 'wide' format with pivot_wider
library(stringr)
library(tidyr)
out %>%
ungroup %>%
mutate(site = str_c('value', site)) %>%
pivot_wider(names_from = site, values_from = pval)
# A tibble: 2 x 3
# type valueA valueB
# <fct> <dbl> <dbl>
#1 AB 0.784 0.492
#2 CD 0.417 0.365
If we want to compare the 'value' columns between 'AB' and 'CD'
dat %>%
group_by(site) %>%
summarise_at(vars(starts_with('value')),
~ t.test(.[type == 'AB'], .[type == 'CD'])$p.value)
# A tibble: 2 x 3
# site valueA valueB
# <fct> <dbl> <dbl>
#1 A 0.393 0.784
#2 B 0.464 0.439
I think I see what you're asking for. See if this works for you:
library(tidyverse)
dat %>%
pivot_longer(cols = c(valueA, valueB), names_to = "name", values_to = "val") %>%
split(.$site) %>%
map(., ~rename(.x, !!sym(paste0(.x$site[[1]], "val")) := val) %>%
select(-site)) %>%
reduce(full_join, by = c("type", "name")) %>%
group_by(type, name) %>%
summarise(p.val = t.test(Aval, Bval)$p.value) %>%
pivot_wider(id_cols = type, names_from = name, values_from = p.val)
#> # A tibble: 2 x 3
#> # Groups: type [2]
#> type valueA valueB
#> <fct> <dbl> <dbl>
#> 1 AB 0.284 0.785
#> 2 CD 0.0703 0.121
Here we go from wide to long, split the dataframe by site. Rename the values of interest to include the site, re-join the dataframe, and then run a grouped t.test by type and and site.

Creating a list from an existing dataframe based on dplyr functions

I currently have a datatframe similar to this one:
df <- tibble("Fam_Name" = c("Architecture", "Arts", "Business", "Managers", "Medicine", "Science"), "Code" = c(1,1,2, 2,3, 3), "Share_2002" = c(0.116, 3.442, 2.445, 1.932, 0.985, 0.321), "Share_2018" = c(0.161, 0.232, 1.234, 0.456, 0.089, 0.06))
I would like to create a list called family which contains three other lists: fam1, fam2, fam3
Each fam(i) list would contain two dataframes called fam_normal and fam_long which are constructed based on dplyr functions, for instance:
fam_normal <- df %>% # I am not sure how to write this so that it is incorporated into the fam(i) list
filter(Code == i) %>%
rename("2002" = Share_2002,
"2018" = Share_2018)
fam_long <- fam_normal %>%
gather(Year, Share, 3:4) %>%
arrange(Fam_Name)
The end goal is to plot a graph for each fam(i) in the fam list where there are Years on the x-axis and Shares on the y-axis.
My real dataset has 25 families and more years.
You could first rename the columns use group_split to split them based on Code and then use map to get list of dataframes.
library(tidyverse)
df %>%
rename("2002" = Share_2002,
"2018" = Share_2018) %>%
group_split(Code) %>%
map(~list(fam_normal = .x, fam_long = .x %>%
gather(Year, Share, 3:4) %>%
arrange(Fam_Name)))
#[[1]]
#[[1]]$fam_normal
# A tibble: 2 x 4
# Fam_Name Code `2002` `2018`
# <chr> <dbl> <dbl> <dbl>
#1 Architecture 1 0.116 0.161
#2 Arts 1 3.44 0.232
#[[1]]$fam_long
# A tibble: 4 x 4
# Fam_Name Code Year Share
# <chr> <dbl> <chr> <dbl>
#1 Architecture 1 2002 0.116
#2 Architecture 1 2018 0.161
#3 Arts 1 2002 3.44
#4 Arts 1 2018 0.232
#....
Here is a base R solution,
dd <- cbind.data.frame(df[1:2], stack(df[-c(1, 2)]))
Map(list, split(df, df$Code), split(dd, dd$Code))
which gives,
$`1`
$`1`[[1]]
# A tibble: 2 x 4
Fam_Name Code Share_2002 Share_2018
<chr> <dbl> <dbl> <dbl>
1 Architecture 1 0.116 0.161
2 Arts 1 3.44 0.232
$`1`[[2]]
Fam_Name Code values ind
1 Architecture 1 0.116 Share_2002
2 Arts 1 3.442 Share_2002
7 Architecture 1 0.161 Share_2018
8 Arts 1 0.232 Share_2018
....
NOTE: You can change column names as per usual
first you can work with the purrr package to work with nested tibbles:
this allows you define the sublists together:
library(tidyverse)
df2 <- df %>%
group_by(Code) %>%
nest(.key = fam_normal) %>%
mutate(fam_long = map(fam_normal, ~gather(.x, Year, Share, -Fam_Name) %>%
arrange(Fam_Name) %>%
mutate(Year = parse_number(Year)))) %>%
unnest(fam_long)
Then you can use ggplot2 to get the plots:
ggplot(df2, aes(x = Year, y = Share, color = Fam_Name)) +
geom_line(size = 2) +
facet_grid(Code~ .)
fam <- list()
fam$normal <- df %>%
filter(Code == i) %>%
rename("2002" = Share_2002,
"2018" = Share_2018)
fam$long <- fam$normal %>%
gather(Year, Share, 3:4) %>%
arrange(Fam_Name)
Now you have a named list fam containing your DFs. Your DFs are so custom that a dplyrsolution may not be as legible as this simple assignment. I am a big fan of tidyverse-style coding but not when it gets in the way of clarity and legibility.
If you want to use this in a pipe, just create a function:
make_families <- function(df) {
# insert code above
# Return `fam`
fam
}`
Then you're done: this will create the list of lists you describe.
df %>%
split(Fam_Name) %>%
purrr::map(make_families)

Multiple paired t-tests on multiple variables simultaneously using dplyr/tidyverse

Assume a data structure like this:
ID testA_wave1 testA_wave2 testA_wave3 testB_wave1 testB_wave2 testB_wave3
1 1 3 2 3 6 5 3
2 2 4 4 4 3 6 6
3 3 10 2 1 4 4 4
4 4 5 3 12 2 7 4
5 5 5 3 9 2 4 2
6 6 10 0 2 6 6 5
7 7 6 8 4 6 8 3
8 8 1 5 4 5 6 0
9 9 3 2 7 8 4 4
10 10 4 9 5 11 8 8
What I want to achieve is to calculate a paired t-test for every test separately (in this case meaning testA and testB, but in real-life I have much more tests). I want to do it that way that I compare the first wave of a given test with every other subsequent wave of the same test (meaning testA_wave1 vs testA_wave2 and testA_wave1 vs testA_wave3 in the case of testA).
This way, I was able to achieve it:
df %>%
gather(variable, value, -ID) %>%
mutate(wave_ID = paste0("wave", parse_number(variable)),
variable = ifelse(grepl("testA", variable), "testA",
ifelse(grepl("testB", variable), "testB", NA_character_))) %>%
group_by(wave_ID, variable) %>%
summarise(value = list(value)) %>%
spread(wave_ID, value) %>%
group_by(variable) %>%
mutate(p_value_w1w2 = t.test(unlist(wave1), unlist(wave2), paired = TRUE)$p.value,
p_value_w1w3 = t.test(unlist(wave1), unlist(wave3), paired = TRUE)$p.value) %>%
select(variable, matches("(p_value)"))
variable p_value_w1w2 p_value_w1w3
<chr> <dbl> <dbl>
1 testA 0.664 0.921
2 testB 0.146 0.418
However, I would like to see different/more elegant solutions that give similar results. I'm looking mostly for dplyr/tidyverse solutions, but if there is a completely different way to achieve it, I'm not against it.
Sample data:
set.seed(123)
df <- data.frame(ID = 1:20,
testA_wave1 = round(rnorm(20, 5, 3), 0),
testA_wave2 = round(rnorm(20, 5, 3), 0),
testA_wave3 = round(rnorm(20, 5, 3), 0),
testB_wave1 = round(rnorm(20, 5, 3), 0),
testB_wave2 = round(rnorm(20, 5, 3), 0),
testB_wave3 = round(rnorm(20, 5, 3), 0))
Since dplyr 0.8.0 we can use group_split to split a dataframe into list of dataframes.
We gather the dataframe and convert it into long format and then separate the names of the column (key) into different columns (test and wave). We then use group_split to split the dataframe into list based on test column. For every dataframe in the list we spread it into wide format and then calculate the t.test values and rbind them into one dataframe using map_dfr.
library(tidyverse)
df %>%
gather(key, value, -ID) %>%
separate(key, c("test", "wave")) %>%
group_split(test) %>% #Previously we had to do split(.$test) here
map_dfr(. %>%
spread(wave, value) %>%
summarise(test = first(test),
p_value_w1w2 = t.test(wave1, wave2, paired = TRUE)$p.value,
p_value_w1w3 = t.test(wave1, wave3, paired = TRUE)$p.value))
# A tibble: 2 x 3
# test p_value_w1w2 p_value_w1w3
# <chr> <dbl> <dbl>
#1 testA 0.664 0.921
#2 testB 0.146 0.418
We manually perform the t-test above as there were only 2 values which needed to be calculated. If there are more number of wave... columns then this could become cumbersome. In such cases we could do
df %>%
gather(key, value, -ID) %>%
separate(key, c("test", "wave")) %>%
group_split(test) %>%
map_dfr(function(data)
data %>%
spread(wave, value) %>%
summarise_at(vars(setdiff(unique(data$wave), "wave1")),
function(x) t.test(.$wave1, x, paired = TRUE)$p.value) %>%
mutate(test = first(data$test)))
# wave2 wave3 test
# <dbl> <dbl> <chr>
#1 0.664 0.921 testA
#2 0.146 0.418 testB
Here it will perform the t-test for every "wave.." column with "wave1" column.
Since you are also open to other solutions, here is an attempt with purely base R solution
sapply(split.default(df[-1], sub("_.*", "", names(df[-1]))), function(x)
c(p_value_w1w2 = t.test(x[[1]], x[[2]],paired = TRUE)$p.value,
p_value_w1w3 = t.test(x[[1]], x[[3]],paired = TRUE)$p.value))
# testA testB
#p_value_w1w2 0.6642769 0.1456059
#p_value_w1w3 0.9209554 0.4184603
We split the columns based on test* and create a list of dataframes and apply t.test on different combinations of columns for each dataframe.
Update 03/16/2022
The tidyverse has evolved and so should this solution.
First I make a simplifying assumption: If we designed the experiment, then we know what the groups are and how many waves we followed them through. If we don't know, then we can extract this information from the column names. See at below.
library("broom")
library("tidyverse")
tests <- c("A", "B")
waves <- 3
comparisons <-
list(
test = tests,
first = 1,
later = seq(2, waves)
) %>%
cross_df()
comparisons
#> # A tibble: 4 × 3
#> test first later
#> <chr> <dbl> <int>
#> 1 A 1 2
#> 2 B 1 2
#> 3 A 1 3
#> 4 B 1 3
Transform the data from wide format to long format.
data <- df %>%
pivot_longer(
-ID,
names_to = "test_wave"
) %>%
extract(
test_wave, c("test", "wave"),
regex = "test(.+)_wave(.+)",
convert = TRUE
)
Then pair the comparisons we want to make with the data we collected. I've added lots of rename statements to make for more readable code but it's not strictly necessary.
comparisons %>%
inner_join(
data,
by = c("test", "first" = "wave")
) %>%
rename(
value.first = value
) %>%
inner_join(
data,
by = c("test", "later" = "wave", "ID")
) %>%
rename(
value.later = value
) %>%
group_by(
test, first, later
) %>%
group_modify(
~ tidy(t.test(.x$value.first, .x$value.later, paired = TRUE))
) %>%
ungroup() %>%
pivot_wider(
id_cols = test,
names_from = later,
names_glue = "wave1_vs_wave{later}",
values_from = p.value
)
#> # A tibble: 2 × 3
#> test wave1_vs_wave2 wave1_vs_wave3
#> <chr> <dbl> <dbl>
#> 1 A 0.664 0.921
#> 2 B 0.146 0.418
Appendix: Extract test names and number of waves from column names.
design <- df %>%
select(starts_with("test")) %>%
colnames() %>%
str_match("test(.+)_wave(.+)")
tests <- unique(design[, 2])
waves <- max(as.integer(design[, 3]))
Created on 2022-03-16 by the reprex package (v2.0.1)
Old solution
Here is one way to do it, using purrr quite a bit.
library("tidyverse")
set.seed(123)
df <- tibble(
ID = 1:20,
testA_wave1 = round(rnorm(20, 5, 3), 0),
testA_wave2 = round(rnorm(20, 5, 3), 0),
testA_wave3 = round(rnorm(20, 5, 3), 0),
testB_wave1 = round(rnorm(20, 5, 3), 0),
testB_wave2 = round(rnorm(20, 5, 3), 0),
testB_wave3 = round(rnorm(20, 5, 3), 0)
)
pvalues <- df %>%
# From wide tibble to long tibble
gather(test, value, -ID) %>%
separate(test, c("test", "wave")) %>%
# Not stricly necessary; will order the waves alphabetically instead
mutate(wave = parse_number(wave)) %>%
inner_join(., ., by = c("ID", "test")) %>%
# If there are two waves w1 and w2,
# we end up with pairs (w1, w1), (w1, w2), (w2, w1) and (w2, w2),
# so filter out to keep the pairing (w1, w2) only
filter(wave.x == 1, wave.x < wave.y) %>%
nest(ID, value.x, value.y) %>%
mutate(pvalue = data %>%
# Perform the test
map(~t.test(.$value.x, .$value.y, paired = TRUE)) %>%
map(broom::tidy) %>%
# Also not strictly necessary; you might want to keep all
# information about the test: estimate, statistic, etc.
map_dbl(pluck, "p.value"))
pvalues
#> # A tibble: 4 x 5
#> test wave.x wave.y data pvalue
#> <chr> <dbl> <dbl> <list> <dbl>
#> 1 testA 1 2 <tibble [20 x 3]> 0.664
#> 2 testA 1 3 <tibble [20 x 3]> 0.921
#> 3 testB 1 2 <tibble [20 x 3]> 0.146
#> 4 testB 1 3 <tibble [20 x 3]> 0.418
pvalues %>%
# Drop the data in order to pivot the table
select(- data) %>%
unite("waves", wave.x, wave.y, sep = ":") %>%
spread(waves, pvalue)
#> # A tibble: 2 x 3
#> test `1:2` `1:3`
#> <chr> <dbl> <dbl>
#> 1 testA 0.664 0.921
#> 2 testB 0.146 0.418
Created on 2019-03-08 by the reprex package (v0.2.1)
To throw in a data.table solution:
library(stringr)
library(data.table)
library(magrittr) ## for the pipe operator
dt_sol <- function(df) {
## create patterns for the melt operation:
## all columns from the same wave should go in one column
grps <- str_extract(names(df)[-1],
"[0-9]+$") %>%
unique() %>%
paste0("wave", ., "$")
grp_names <- sub("\\$", "", grps)
## melt the data table: all test*_wave_i data go into column wave_i
df.m <- melt(df,
measure = patterns(grps),
value.name = grp_names,
variable.name = "test")
## define the names for the new column, we want to extract estimate and p.value
new_cols <- c(outer(c("p.value", "estimate"),
grp_names[-1],
paste, sep = "_"))
## use lapply on .SD which equals to all wave_i columns but the first one
## return estimate and p.value
df.m[,
setNames(unlist(lapply(.SD,
function(col) {
t.test(wave1, col, paired = TRUE)[c("p.value", "estimate")]
}), recursive = FALSE), new_cols),
test, ## group by each test
.SDcols = grp_names[-1]]
}
dt <- copy(df)
setDT(dt)
dt_sol(dt)
# test p.value_wave2 estimate_wave2 p.value_wave3 estimate_wave3
# 1: 1 0.6642769 0.40 0.9209554 -0.1
# 2: 2 0.1456059 -1.45 0.4184603 0.7
Benchmark
Comparing the data.table solution to the tidyverse solution we get an 3-fold speed increase with teh data.tablesolution:
dp_sol <- function(df) {
df %>%
gather(test, value, -ID) %>%
separate(test, c("test", "wave")) %>%
inner_join(., ., by = c("ID", "test")) %>%
filter(wave.x == 1, wave.x < wave.y) %>%
nest(ID, value.x, value.y) %>%
mutate(pvalue = data %>%
map(~t.test(.$value.x, .$value.y, paired = TRUE)) %>%
map(broom::tidy) %>%
map_dbl(pluck, "p.value"))
}
library(microbenchmark)
microbenchmark(dplyr = dp_sol(df),
data.table = dt_sol(dt))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dplyr 6.119273 6.897456 7.639569 7.348364 7.996607 14.938182 100 b
# data.table 1.902547 2.307395 2.790910 2.758789 3.133091 4.923153 100 a
With a slightly bigger input:
make_df <- function(nr_tests = 2,
nr_waves = 3,
n_per_wave = 20) {
mat <- cbind(seq(1, n_per_wave),
matrix(round(rnorm(nr_tests * nr_waves * n_per_wave), 0),
nrow = n_per_wave))
c_names <- c(outer(1:nr_waves, 1:nr_tests, function(w, t) glue::glue("test{t}_wave{w}")))
colnames(mat) <- c("ID", c_names)
as.data.frame(mat)
}
df2 <- make_df(100, 100, 10)
dt2 <- copy(df2)
setDT(dt2)
microbenchmark(dplyr = dp_sol(df2),
data.table = dt_sol(dt2)
# Unit: seconds
# expr min lq mean median uq max neval cld
# dplyr 3.469837 3.669819 3.877548 3.821475 3.984518 5.268596 100 b
# data.table 1.018939 1.126244 1.193548 1.173175 1.252855 1.743075 100 a
Using all combinations without replacement:
Just for testA group:
comb <- arrangements::combinations(names(df)[grep("testA",names(df))], k = 2,n = 3,replace = F )
tTest <- function(x, data = df){
ttest <- t.test(x =data[x[1]] , y = data[x[2]])
return(data.frame(var1 = x[1],
var2 = x[2],
t = ttest[["statistic"]][["t"]],
pvalue = ttest[["p.value"]]))
}
result <- apply(comb, 1, tTest, data = df)
Result:
dplyr::bind_rows(result)
var1 var2 t pvalue
1 testA_wave1 testA_wave2 0.5009236 0.6193176
2 testA_wave1 testA_wave3 -0.6426433 0.5243146
3 testA_wave2 testA_wave3 -1.1564854 0.2547069
For all groups:
comb <- arrangements::combinations(x = names(df)[-1], k = 2,n = 6, replace = F )
result <- apply(comb, 1, tTest, data = df)
Result:
dplyr::bind_rows(result)
var1 var2 t pvalue
1 testA_wave1 testA_wave2 0.5009236 0.6193176
2 testA_wave1 testA_wave3 -0.6426433 0.5243146
3 testA_wave1 testB_wave1 0.4199215 0.6769510
4 testA_wave1 testB_wave2 -0.3447992 0.7321465
5 testA_wave1 testB_wave3 0.0000000 1.0000000
6 testA_wave2 testA_wave3 -1.1564854 0.2547069
7 testA_wave2 testB_wave1 -0.1070172 0.9153442
8 testA_wave2 testB_wave2 -0.8516264 0.3997630
9 testA_wave2 testB_wave3 -0.5640491 0.5762010
10 testA_wave3 testB_wave1 1.1068781 0.2754186
11 testA_wave3 testB_wave2 0.2966237 0.7683692
12 testA_wave3 testB_wave3 0.7211103 0.4755291
13 testB_wave1 testB_wave2 -0.7874100 0.4360152
14 testB_wave1 testB_wave3 -0.4791735 0.6346043
15 testB_wave2 testB_wave3 0.3865414 0.7013933
To throw another, somewhat more concise, data.table solution into the mix, in which we melt the data into long format:
setDT(df)
x = melt(df[,-1])[, tname := sub('_.+','',variable)][, wave := sub('.+_','',variable)]
x[wave != 'wave1', .(p.value =
t.test(x[tname==test & wave == 'wave1', value], value, paired = TRUE)$p.value),
by = .(test=tname,wave)]
# test wave p.value
# 1: testA wave2 0.6642769
# 2: testA wave3 0.9209554
# 3: testB wave2 0.1456059
# 4: testB wave3 0.4184603

Resources