I want to do an unpaired t-test to examine if values differ between sites in each type category.
So my question is, within types (AB or CD), do values (valueA or valueB) differ between sites (A or B)?
Here is an example of my data:
dat <- data.frame(
"site" = c("A","B","B","A","A","B","B","A"),
"type" = c("AB","CD"),
"valueA" = c(13,-10,-5,18,-14,12,-17,19),
"valueB" = c(-3,20,15,-16,12,15,-11,14)
)
dat
site type valueA valueB
A AB 13 -3
B CD -10 20
B AB -5 15
A CD 18 -16
A AB -14 12
B CD 12 15
B AB -17 -11
A CD 19 14
I am trying to do four unpaired t-tests to examine:
If valueA Type AB, differs between site A vs. site B
If valueB Type AB, differs between site A vs. site B
If valueA Type CD, differs between site A vs. site B
If valueB Type CD, differs between site A vs. site B
In order to run the unpaired t-test, I believe I need to re-arrange my data so that type AB and type CB and site A and site B are each a column (instead of being within the type or site column).
EDIT:
Using the suggested code in the comments:
library(dplyr)
d %>%
group_by(site, type) %>%
summarise(pval = t.test(valueA, valueB)$p.value)
The output is this:
site type pval
A AB 0.784
A CD 0.417
B AB 0.492
B CD 0.365
To my understanding, this p-value here is giving me the difference between valueA and valueB.
I am looking for, for example:
The difference between site A and site B of valueA in type CD.
So if I am thinking correctly, the output of the t-test should have a column for type, value A and value B. Then the p-values are for the differences between sites.
Similar to this:
type valueA valueB
AB 0.365 0.784
CD 0.492 0.417
Does this make sense?
We can do a group_by 'site', 'type' and apply the t.test
library(dplyr)
out <- dat %>%
group_by(site, type) %>%
summarise(pval = t.test(valueA, valueB)$p.value)
By default, paired = FALSE in t.test
The output above can be reshaped to 'wide' format with pivot_wider
library(stringr)
library(tidyr)
out %>%
ungroup %>%
mutate(site = str_c('value', site)) %>%
pivot_wider(names_from = site, values_from = pval)
# A tibble: 2 x 3
# type valueA valueB
# <fct> <dbl> <dbl>
#1 AB 0.784 0.492
#2 CD 0.417 0.365
If we want to compare the 'value' columns between 'AB' and 'CD'
dat %>%
group_by(site) %>%
summarise_at(vars(starts_with('value')),
~ t.test(.[type == 'AB'], .[type == 'CD'])$p.value)
# A tibble: 2 x 3
# site valueA valueB
# <fct> <dbl> <dbl>
#1 A 0.393 0.784
#2 B 0.464 0.439
I think I see what you're asking for. See if this works for you:
library(tidyverse)
dat %>%
pivot_longer(cols = c(valueA, valueB), names_to = "name", values_to = "val") %>%
split(.$site) %>%
map(., ~rename(.x, !!sym(paste0(.x$site[[1]], "val")) := val) %>%
select(-site)) %>%
reduce(full_join, by = c("type", "name")) %>%
group_by(type, name) %>%
summarise(p.val = t.test(Aval, Bval)$p.value) %>%
pivot_wider(id_cols = type, names_from = name, values_from = p.val)
#> # A tibble: 2 x 3
#> # Groups: type [2]
#> type valueA valueB
#> <fct> <dbl> <dbl>
#> 1 AB 0.284 0.785
#> 2 CD 0.0703 0.121
Here we go from wide to long, split the dataframe by site. Rename the values of interest to include the site, re-join the dataframe, and then run a grouped t.test by type and and site.
Related
I would like to perform multiple pairwise t-tests on a dataset containing about 400 different column variables and 3 subject groups, and extract p-values for every comparison. A shorter representative example of the data, using only 2 variables could be the following;
df <- tibble(var1 = rnorm(90, 1, 1), var2 = rnorm(90, 1.5, 1), group = rep(1:3, each = 30))
Ideally the end result will be a summarised data frame containing four columns; one for the variable being tested (var1, var2 etc.), two for the groups being tested every time and a final one for the p-value.
I've tried duplicating the group column in the long form, and doing a double group_by in order to do the comparisons but with no result
result <- df %>%
pivot_longer(var1:var2, "var", "value") %>%
rename(group_a = group) %>%
mutate(group_b = group_a) %>%
group_by(group_a, group_b) %>%
summarise(n = n())
We can reshape the data into 'long' format with pivot_longer, then grouped by 'group', apply the pairwise.t.test, extract the list elements and transform into tibble with tidy (from broom) and unnest the list column
library(dplyr)
library(tidyr)
library(broom)
df %>%
pivot_longer(cols = -group, names_to = 'grp') %>%
group_by(group) %>%
summarise(out = list(pairwise.t.test(value, grp
) %>%
tidy)) %>%
unnest(c(out))
-output
# A tibble: 3 x 4
group group1 group2 p.value
<int> <chr> <chr> <dbl>
1 1 var2 var1 0.0760
2 2 var2 var1 0.0233
3 3 var2 var1 0.000244
In case you end up wanting more information about the t-tests, here is an approach that will allow you to extract more information such as the degrees of freedom and value of the test statistic:
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
df <- tibble(
var1 = rnorm(90, 1, 1),
var2 = rnorm(90, 1.5, 1),
group = rep(1:3, each = 30)
)
df %>%
select(-group) %>%
names() %>%
map_dfr(~ {
y <- .
combn(3, 2) %>%
t() %>%
as.data.frame() %>%
pmap_dfr(function(V1, V2) {
df %>%
select(group, all_of(y)) %>%
filter(group %in% c(V1, V2)) %>%
t.test(as.formula(sprintf("%s ~ group", y)), ., var.equal = TRUE) %>%
tidy() %>%
transmute(y = y,
group_1 = V1,
group_2 = V2,
df = parameter,
t_value = statistic,
p_value = p.value
)
})
})
#> # A tibble: 6 x 6
#> y group_1 group_2 df t_value p_value
#> <chr> <int> <int> <dbl> <dbl> <dbl>
#> 1 var1 1 2 58 -0.337 0.737
#> 2 var1 1 3 58 -1.35 0.183
#> 3 var1 2 3 58 -1.06 0.295
#> 4 var2 1 2 58 -0.152 0.879
#> 5 var2 1 3 58 1.72 0.0908
#> 6 var2 2 3 58 1.67 0.100
And here is #akrun's answer tweaked to give the same p-values as the above approach. Note the p.adjust.method = "none" which gives independent t-tests which will inflate your Type I error rate.
df %>%
pivot_longer(
cols = -group,
names_to = "y"
) %>%
group_by(y) %>%
summarise(
out = list(
tidy(
pairwise.t.test(
value,
group,
p.adjust.method = "none",
pool.sd = FALSE
)
)
)
) %>%
unnest(c(out))
#> # A tibble: 6 x 4
#> y group1 group2 p.value
#> <chr> <chr> <chr> <dbl>
#> 1 var1 2 1 0.737
#> 2 var1 3 1 0.183
#> 3 var1 3 2 0.295
#> 4 var2 2 1 0.879
#> 5 var2 3 1 0.0909
#> 6 var2 3 2 0.100
Created on 2021-07-30 by the reprex package (v1.0.0)
I have a dataset with features {a,b,c...} belonging to a pair of players taken form the set {a, b, c}. Each row represents the outcome of a matchup, columns name_1, name_2 represent player names, and all other columns a1, a2, b1, b2, c1, c2, etc.. represent numeric features corresponding to the player in the matchup.
Below is the example of a dataset:
set.seed(17)
df <- tibble(
name_1 = sample(letters[1:3], length(letters), replace = TRUE),
name_2 = sample(letters[1:3], length(letters), replace = TRUE),
a1 = rnorm(length(letters)),
a2 = rnorm(length(letters)),
b1 = rnorm(length(letters)),
b2 = rnorm(length(letters)),
c1 = rnorm(length(letters)),
c2 = rnorm(length(letters))) %>%
filter(!(name_1 == name_2))
What I need is to find a summary statistic for each feature grouped by player. The trouble is that the same player, for example, a, can be located sometimes under name_1, sometimes under name_2, hence his features can be located at feature1 or feature2.
Here is my feeble attempt to do this for one player (namely, a) and one feature (namely, a):
df %>%
mutate(feature_a_joined = case_when(df$name_1 == "a" ~ a1,
df$name_2 == "a" ~ a2)) %>%
summarise(mean = mean(feature_a_joined, na.rm = TRUE))
I am fairly new to R, but the examples that I`ve seen in multiple vignettes refer to more standard datasets. Is there an efficient way to make a summary for each player and each variable?
Update
My expected result would be something like this:
# A tibble: 3 x 4
player feature_a_mean feature_b_mean feature_c_mean
<chr> <dbl> <dbl> <dbl>
1 a -0.330 2.38 0.960
2 b -0.482 1.30 0.207
3 c -0.482 -0.477 -1.71
We can use map. Get the unique column names ('un1') from the data. Loop over those (map), apply the OP's code with case_when and get the mean
library(dplyr)
library(purrr)
library(stringr)
un1 <- unique(str_remove(names(df)[-(1:2)], "\\d+"))
map_dfc(un1, ~
df %>%
summarise(!! str_c('mean_', .x) :=
mean(case_when(name_1 == .x ~ !! rlang::sym(str_c(.x, '1')),
name_2 == .x ~ !! rlang::sym(str_c(.x, '2'))),
na.rm = TRUE)))
-output
# A tibble: 1 x 3
# mean_a mean_b mean_c
# <dbl> <dbl> <dbl>
#1 -0.00673 0.186 -0.0632
Update
Based on the OP's expected output (assuming the output values are placeholders), we reshape the multiple blocks of columns to 'long' format with pivot_longer, do a group by to get the summarise across columns 'a' to 'c'
library(tidyr)
df %>%
pivot_longer(everything(), names_to = c('.value', 'grp'),
names_sep= '(?<=[a-z])_?(?=[0-9])') %>%
group_by(player = name) %>%
summarise(across(a:c, mean, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 3 x 4
# player a b c
# <chr> <dbl> <dbl> <dbl>
#1 a -0.00673 0.197 0.126
#2 b -0.0455 0.186 -0.138
#3 c -0.118 -0.468 -0.0632
I currently have a datatframe similar to this one:
df <- tibble("Fam_Name" = c("Architecture", "Arts", "Business", "Managers", "Medicine", "Science"), "Code" = c(1,1,2, 2,3, 3), "Share_2002" = c(0.116, 3.442, 2.445, 1.932, 0.985, 0.321), "Share_2018" = c(0.161, 0.232, 1.234, 0.456, 0.089, 0.06))
I would like to create a list called family which contains three other lists: fam1, fam2, fam3
Each fam(i) list would contain two dataframes called fam_normal and fam_long which are constructed based on dplyr functions, for instance:
fam_normal <- df %>% # I am not sure how to write this so that it is incorporated into the fam(i) list
filter(Code == i) %>%
rename("2002" = Share_2002,
"2018" = Share_2018)
fam_long <- fam_normal %>%
gather(Year, Share, 3:4) %>%
arrange(Fam_Name)
The end goal is to plot a graph for each fam(i) in the fam list where there are Years on the x-axis and Shares on the y-axis.
My real dataset has 25 families and more years.
You could first rename the columns use group_split to split them based on Code and then use map to get list of dataframes.
library(tidyverse)
df %>%
rename("2002" = Share_2002,
"2018" = Share_2018) %>%
group_split(Code) %>%
map(~list(fam_normal = .x, fam_long = .x %>%
gather(Year, Share, 3:4) %>%
arrange(Fam_Name)))
#[[1]]
#[[1]]$fam_normal
# A tibble: 2 x 4
# Fam_Name Code `2002` `2018`
# <chr> <dbl> <dbl> <dbl>
#1 Architecture 1 0.116 0.161
#2 Arts 1 3.44 0.232
#[[1]]$fam_long
# A tibble: 4 x 4
# Fam_Name Code Year Share
# <chr> <dbl> <chr> <dbl>
#1 Architecture 1 2002 0.116
#2 Architecture 1 2018 0.161
#3 Arts 1 2002 3.44
#4 Arts 1 2018 0.232
#....
Here is a base R solution,
dd <- cbind.data.frame(df[1:2], stack(df[-c(1, 2)]))
Map(list, split(df, df$Code), split(dd, dd$Code))
which gives,
$`1`
$`1`[[1]]
# A tibble: 2 x 4
Fam_Name Code Share_2002 Share_2018
<chr> <dbl> <dbl> <dbl>
1 Architecture 1 0.116 0.161
2 Arts 1 3.44 0.232
$`1`[[2]]
Fam_Name Code values ind
1 Architecture 1 0.116 Share_2002
2 Arts 1 3.442 Share_2002
7 Architecture 1 0.161 Share_2018
8 Arts 1 0.232 Share_2018
....
NOTE: You can change column names as per usual
first you can work with the purrr package to work with nested tibbles:
this allows you define the sublists together:
library(tidyverse)
df2 <- df %>%
group_by(Code) %>%
nest(.key = fam_normal) %>%
mutate(fam_long = map(fam_normal, ~gather(.x, Year, Share, -Fam_Name) %>%
arrange(Fam_Name) %>%
mutate(Year = parse_number(Year)))) %>%
unnest(fam_long)
Then you can use ggplot2 to get the plots:
ggplot(df2, aes(x = Year, y = Share, color = Fam_Name)) +
geom_line(size = 2) +
facet_grid(Code~ .)
fam <- list()
fam$normal <- df %>%
filter(Code == i) %>%
rename("2002" = Share_2002,
"2018" = Share_2018)
fam$long <- fam$normal %>%
gather(Year, Share, 3:4) %>%
arrange(Fam_Name)
Now you have a named list fam containing your DFs. Your DFs are so custom that a dplyrsolution may not be as legible as this simple assignment. I am a big fan of tidyverse-style coding but not when it gets in the way of clarity and legibility.
If you want to use this in a pipe, just create a function:
make_families <- function(df) {
# insert code above
# Return `fam`
fam
}`
Then you're done: this will create the list of lists you describe.
df %>%
split(Fam_Name) %>%
purrr::map(make_families)
I have a data frame with about 50 variables but where the ones in the example under are the most important. My aim is to create a table that includes various elements split by department and gender. The combination of dplyr, group_by and summarise gives me most of what I need but I haven't been able to figure out how to get separate columns that shows for example meanFemaleSalary/meanMaleSalary per department. I'm able to get the mean salary per gender per department in separate data frames, but either get an error or just a single value when I try to divide them.
I have tried searching the site and found what I believed was similar questions but couldn't get any of the answers to work. I'd be grateful if anyone could give me a hint on how to proceed…
Thanks!
Example:
library(dplyr)
x <- data.frame(Department = rep(c("Dep1", "Dep2", "Dep3"), times=2),
Gender = rep(c("F", "M"), times=3),
Salary = seq(10,15))
This is what I have that actually works so far:
Table <- x %>% group_by(Department, Gender) %>% summarise(Count = n(),
AverageSalary = mean(Salary, na.rm = T),
MedianSalary = median(Salary, na.rm = T))
I'd like two additional columns for AvgSalaryWomen/Men and MedianSalaryWomen/Men.
Again thanks!
If you want the new columns to be part of Table you could do something like this. But it will result in the value being repeated per department.
Table %>% group_by(Department) %>%
mutate(`AvgSalaryWomen/Men` = AverageSalary[Gender == "F"]/AverageSalary[Gender == "M"],
`MedianSalaryWomen/Men` = MedianSalary[Gender == "F"]/MedianSalary[Gender == "M"])
# Department Gender Count AverageSalary MedianSalary `AvgSalaryWomen/Men` `MedianSalaryWomen/Men`
# <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
# 1 Dep1 F 1 10. 10 0.769 0.769
# 2 Dep1 M 1 13. 13 0.769 0.769
# 3 Dep2 F 1 14. 14 1.27 1.27
# 4 Dep2 M 1 11. 11 1.27 1.27
# 5 Dep3 F 1 12. 12 0.800 0.800
# 6 Dep3 M 1 15. 15 0.800 0.800
If you want just one row per department simply change mutate to summarise and you'll get
# Department `AvgSalaryWomen/Men` `MedianSalaryWomen/Men`
# <fct> <dbl> <dbl>
# 1 Dep1 0.769 0.769
# 2 Dep2 1.27 1.27
# 3 Dep3 0.800 0.800
Here is an option to get this by spreading it to wide format
library(tidyverse)
x %>%
spread(Gender, Salary) %>%
group_by(Department) %>%
summarise(`AvgSalaryWomen/Men` = mean(F)/mean(M),
`MedianSalaryWomen/Men` = median(F)/median(M))
# A tibble: 3 x 3
# Department `AvgSalaryWomen/Men` `MedianSalaryWomen/Men`
# <fctr> <dbl> <dbl>
# 1 Dep1 0.769 0.769
# 2 Dep2 1.27 1.27
# 3 Dep3 0.800 0.800 `
If you want to end up with a table that has one row per department and includes all of the descriptive statistics you're computing along the way, you probably need to convert to long, unite some columns to use as a key, go back to wide, and then add your ratios. Something like...
Table <- x %>%
group_by(Department, Gender) %>%
summarise(Count = n(),
AverageSalary = mean(Salary, na.rm = TRUE),
MedianSalary = median(Salary, na.rm = TRUE)) %>%
# convert to long form
gather(Quantity, Value, -Department, -Gender) %>%
# create a unified gender/measure column to use as the key in the next step
unite(Set, Gender, Quantity) %>%
# go back to wide, now with repeating columns by gender
spread(Set, Value) %>%
# compute the department-level quantities you want using those new cols
mutate(AverageSalaryWomenMen = F_AverageSalary/M_AverageSalary,
MedianSalaryWomenMen = F_MedianSalary/M_MedianSalary)
I have a dataframe with panel structure: 2 observations for each unit from two years:
library(tidyr)
mydf <- data.frame(
id = rep(1:3, rep(2,3)),
year = rep(c(2012, 2013), 3),
value = runif(6)
)
mydf
# id year value
#1 1 2012 0.09668064
#2 1 2013 0.62739399
#3 2 2012 0.45618433
#4 2 2013 0.60347152
#5 3 2012 0.84537624
#6 3 2013 0.33466030
I would like to reshape this data to wide format which can be done easily with tidyr::spread. However, as the values of the year variable are numbers, the names of my new variables become numbers as well which makes its further use harder.
spread(mydf, year, value)
# id 2012 2013
#1 1 0.09668064 0.6273940
#2 2 0.45618433 0.6034715
#3 3 0.84537624 0.3346603
I know I can easily rename the columns. However, if I would like to reshape within a chain with other operations, it becomes inconvenient. E.g. the following line obviously does not make sense.
library(dplyr)
mydf %>% spread(year, value) %>% filter(2012 > 0.5)
The following works but is not that concise:
tmp <- spread(mydf, year, value)
names(tmp) <- c("id", "y2012", "y2013")
filter(tmp, y2012 > 0.5)
Any idea how I can change the new variable names within spread?
I know some years has passed since this question was originally asked, but for posterity I want to also highlight the sep argument of spread. When not NULL, it will be used as separator between the key name and values:
mydf %>%
spread(key = year, value = value, sep = "")
# id year2012 year2013
#1 1 0.15608322 0.6886531
#2 2 0.04598124 0.0792947
#3 3 0.16835445 0.1744542
This is not exactly as wanted in the question, but sufficient for my purposes. See ?spread.
Update with tidyr 1.0.0: tidyr 1.0.0 have now introduced pivot_wider (and pivot_longer) which allows for more control in this respect with the arguments names_sep and names_prefix. So now the call would be:
mydf %>%
pivot_wider(names_from = year, values_from = value,
names_prefix = "year")
# # A tibble: 3 x 3
# id year2012 year2013
# <int> <dbl> <dbl>
# 1 1 0.347 0.388
# 2 2 0.565 0.924
# 3 3 0.406 0.296
To get exactly what was originally wanted (prefixing "y" only) you can of course now get that directly by simply having names_prefix = "y".
The names_sep is used in case you gather over multiple columns as demonstrated below where I have added quarters to the data:
# Add quarters to data
mydf2 <- data.frame(
id = rep(1:3, each = 8),
year = rep(rep(c(2012, 2013), each = 4), 3),
quarter = rep(c("Q1","Q2","Q3","Q4"), 3),
value = runif(24)
)
head(mydf2)
# id year quarter value
# 1 1 2012 Q1 0.8651470
# 2 1 2012 Q2 0.3944423
# 3 1 2012 Q3 0.4580580
# 4 1 2012 Q4 0.2902604
# 5 1 2013 Q1 0.4751588
# 6 1 2013 Q2 0.6851755
mydf2 %>%
pivot_wider(names_from = c(year, quarter), values_from = value,
names_sep = "_", names_prefix = "y")
# # A tibble: 3 x 9
# id y2012_Q1 y2012_Q2 y2012_Q3 y2012_Q4 y2013_Q1 y2013_Q2 y2013_Q3 y2013_Q4
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.865 0.394 0.458 0.290 0.475 0.685 0.213 0.920
# 2 2 0.566 0.614 0.509 0.0515 0.974 0.916 0.681 0.509
# 3 3 0.968 0.615 0.670 0.748 0.723 0.996 0.247 0.449
You can use backticks for column names starting with numbers and filter should work as expected
mydf %>%
spread(year, value) %>%
filter(`2012` > 0.5)
# id 2012 2013
#1 3 0.8453762 0.3346603
Or another option would be using unite to join two columns to a single columnn after creating a second column 'year1' with string 'y'.
mydf %>%
mutate(year1='y') %>%
unite(yearN, year1, year) %>%
spread(yearN, value) %>%
filter(y_2012 > 0.5)
# id y_2012 y_2013
#1 3 0.8453762 0.3346603
Even we can change the 'year' column within mutate by using paste
mydf %>%
mutate(year=paste('y', year, sep="_")) %>%
spread(year, value) %>%
filter(y_2012 > 0.5)
Another option is to use the setNames() function as the next thing in the pipe:
mydf %>%
spread(mydf, year, value) %>%
setNames( c("id", "y2012", "y2013") ) %>%
filter(y2012 > 0.5)
The only problem using setNames is that you have to know exactly what your columns will be when you spread() them. Most of the time, that's not a problem, particularly if you're working semi-interactively.
But if you're missing a key/value pair in your original data, there's a chance it won't show up as a column, and you can end up naming your columns incorrectly without even knowing it. Granted, setNames() will throw an error if the number of names doesn't match the number of columns, so you've got a bit of error checking built in.
Still, the convenience of using setNames() has outweighed the risk more often than not for me.
Using spread()'s successor pivot_wider() we can give a prefix to the created columns :
library(tidyr)
set.seed(1)
mydf <- data.frame(
id = rep(1:3, rep(2,3)),
year = rep(c(2012, 2013), 3),
value = runif(6)
)
pivot_wider(mydf, names_from = "year", values_from = "value", names_prefix = "y")
#> # A tibble: 3 x 3
#> id y2012 y2013
#> <int> <dbl> <dbl>
#> 1 1 0.266 0.372
#> 2 2 0.573 0.908
#> 3 3 0.202 0.898
Created on 2019-09-14 by the reprex package (v0.3.0)
rename() in dplyr should do the trick
library(tidyr); library(dplyr)
mydf %>%
spread(year,value)%>%
rename(y2012 = '2012',y2013 = '2013')%>%
filter(y2012>0.5)