Grouped matrix correlation - r

I'm trying to get correlation matrices of an arbitrary number of factors by group, ideally using dplyr. I have no problem getting the correlation matrix by filtering by group and summarizing, but using a "group_by", I'm not sure how to pass the factor data to cor.
library(dplyr)
numRows <- 20
myData <- tibble(A = rnorm(numRows),
B = rnorm(numRows),
C = rnorm(numRows),
Group = c(rep("Group1", numRows/2), rep("Group2", numRows/2)))
# Essentially what I'm doing is trying to get these matrices, but for all groups
myData %>%
filter(Group == "Group1") %>%
select(-Group) %>%
summarize(CorMat = cor(.))
# However, I don't know what to pass into "cor". The code below fails
myData %>%
group_by(Group) %>%
summarize(CorMat = cor(.))
# Error looks like this
Error: Problem with `summarise()` column `CorMat`.
i `CorMat = cor(.)`.
x 'x' must be numeric
i The error occurred in group 1: Group = "Group1".
I've seen solutions for the grouped correlation between specific factors (Correlation matrix by group) or correlations between all factors to a specific factor (Correlation matrix of grouped variables in dplyr), but nothing for a grouped correlation matrix of all factors to all factors.

You can try using nest_by which will put you data (without Group) into a list column called data. Then you can refer to this column using cor:
myData %>%
nest_by(Group) %>%
summarise(CorMat = cor(data))
Output
Group CorMat[,1] [,2] [,3]
<chr> <dbl> <dbl> <dbl>
1 Group1 1 -0.132 0.638
2 Group1 -0.132 1 -0.284
3 Group1 0.638 -0.284 1
4 Group2 1 0.429 -0.228
5 Group2 0.429 1 -0.235
6 Group2 -0.228 -0.235 1
If you want a named list of matrices, you can also try the following. You can add split (or try group_split without names) and then map to remove the Group column.
library(tidyverse)
myData %>%
nest_by(Group) %>%
summarise(CorMat = cor(data)) %>%
ungroup %>%
split(f = .$Group) %>%
map(~ .x %>% select(-Group))
Output
$Group1
# A tibble: 3 x 1
CorMat[,1] [,2] [,3]
<dbl> <dbl> <dbl>
1 1 -0.132 0.638
2 -0.132 1 -0.284
3 0.638 -0.284 1
$Group2
# A tibble: 3 x 1
CorMat[,1] [,2] [,3]
<dbl> <dbl> <dbl>
1 1 0.429 -0.228
2 0.429 1 -0.235
3 -0.228 -0.235 1

Related

conditionally mutating column values using `dplyr`

I am using WRS2 to carry out robust pairwise comparisons. But one problem is that it removes the group level names from the output dataframes and saves it in a different object.
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# comparisons
x$comp
#> Group Group psihat ci.lower ci.upper p.value
#> [1,] 1 2 -1.0 -3.440879 1.44087853 0.25984505
#> [2,] 1 3 -2.8 -5.536161 -0.06383861 0.04914871
#> [3,] 2 3 -1.8 -4.536161 0.93616139 0.17288911
# vector with group level names
x$fnames
#> [1] "placebo" "low" "high"
I can convert it to a tibble:
# converting to tibble
suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 -1 -3.44 1.44 0.260
#> 2 1 3 -2.8 -5.54 -0.0638 0.0491
#> 3 2 3 -1.8 -4.54 0.936 0.173
I would then like to replace the group column numeric values with actual names included in fnames (so map fnames[1] -> 1, fnames[2] -> 2, and so on).
So the final dataframe should look something like the following-
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
In this case, it was easy to just copy-paste the three values, but I want to have a generalizable approach where no matter the number of levels, it works. How can I do this using dplyr?
Using a named vector to match with tidyverse. This matches by value and not by the sequence of index i.e. if the value in 'Group' columns are not in a sequence or character, this would still work
library(dplyr)
as_tibble(x$comp, .name_repair = 'unique') %>%
mutate(across(starts_with("Group"),
~ setNames(x$fnames, seq_along(x$fnames))[as.character(.)]))
Does this fullfil your needs :
names <- c("A","B","C")
df = data.frame(group=c(1,2,3))
library(dplyr)
df %>% mutate(group = names[group])
group
1 A
2 B
3 C
Here's an approach using the recode function, with the recoding vector built programmatically from the data:
# Setup
set.seed(123)
library(WRS2)
library(tidyverse)
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# Create recoding vector
recode.vec = x$fnames %>% set_names(1:length(x$fnames))
# Recode columns
x.comp = x$comp %>%
as_tibble(.name_repair=make.unique) %>%
mutate(across(starts_with("Group"), ~recode(., !!!recode.vec)))
Output:
x.comp
#> # A tibble: 3 x 6
#> Group Group.1 psihat ci.lower ci.upper p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
Try this tidyverse approach formating data to long after extracting the objects as tibbles. You can use left_join() to get your groups as you want. Here the code to get something close to what you want:
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
#Transform to tibble
df1 <- suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#Extract labels
df2 <- tibble(treat=x$fnames) %>% mutate(value=1:n())
#Format to long df1
df1 <- df1 %>%
mutate(id=1:n()) %>%
pivot_longer(cols = c(group1,group2)) %>%
rename(group=name) %>% left_join(df2) %>% select(-value) %>%
pivot_wider(names_from = group,values_from=treat) %>% select(-id)
Output:
# A tibble: 3 x 6
psihat ci.lower ci.upper p.value group1 group2
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 -1 -3.44 1.44 0.260 placebo low
2 -2.8 -5.54 -0.0638 0.0491 placebo high
3 -1.8 -4.54 0.936 0.173 low high

How to write a function that conducts paired t-tests on all group/variable combinations in a data frame

I have a data frame similar to data created below:
ID <- data.frame(ID=rep(c(12,122,242,329,595,130,145,245,654,878),each=5))
Var <- data.frame(Variable=c("Copper","Iron","Lead","Zinc","CaCO"))
n <- 10
Variable <- do.call("rbind",replicate(n,Var,simplify=F))
Location <- rep(c("Alpha","Beta","Gamma"), times=c(20,20,10))
Location <- data.frame(Location)
set.seed(1)
FirstPt<- data.frame(FirstPt=sample(1:100,50,replace=T))
LastPt <- data.frame(LastPt=sample(1:100,50,replace=T))
First3<- data.frame(First3=sample(1:100,50,replace=T))
First5<- data.frame(First5=sample(1:100,50,replace=T))
First7<- data.frame(First7=sample(1:100,50,replace=T))
First10<- data.frame(First10=sample(1:100,50,replace=T))
Last3<- data.frame(Last3=sample(1:100,50,replace=T))
Last5<- data.frame(Last5=sample(1:100,50,replace=T))
Last7<- data.frame(Last7=sample(1:100,50,replace=T))
Last10<- data.frame(Last10=sample(1:100,50,replace=T))
data <- cbind(ID,Location,Variable,FirstPt,LastPt,First3,First5,First7,
First10,Last3,Last5,Last7,Last10)
This may be a two part question, but I want to write a function that groups all Variables that are the same (for instance, all the observations that are Copper) and conducts a paired t test between all possible combinations of the numeric columns (FirstPt:Last10). I want it to return the p values in a data frame like this:
Test P-Value
FirstPt.vs.LastPt …
FirstPt.vs.First3 …
ect... …
This will likely be a second function, but I also want to do this after the observations are grouped by Location so that the output data frame will look like this:
Test P-Value
FirstPt.vs.LastPt.InAlpha
FirstPt.vs.LastPt.InBeta
ect...
You can do both of these with one function:
library(tidyverse)
t.test.by.group.combos <- function(.data, groups){
by <- gsub(x = rlang::quo_get_expr(enquo(groups)), pattern = "\\((.*)?\\)", replacement = "\\1")[-1]
.data %>%
group_by(!!!groups) %>%
select_if(is.integer) %>%
group_split() %>%
map(.,
~pivot_longer(., cols = (FirstPt:Last10), names_to = "name", values_to = "val") %>%
nest(data = val) %>%
full_join(.,.,by = by) %>%
filter(name.x != name.y) %>%
mutate(test = paste(name.x, "vs",name.y, !!!groups, sep = "."),
p.value = map2_dbl(data.x,data.y, ~t.test(unlist(.x), unlist(.y))$p.value)) %>%
select(test,p.value)%>%
filter(!duplicated(p.value))
) %>%
bind_rows()
}
t.test.by.group.combos(data, vars(Variable))
#> # A tibble: 225 x 2
#> test p.value
#> <chr> <dbl>
#> 1 FirstPt.vs.LastPt.CaCO 0.511
#> 2 FirstPt.vs.First3.CaCO 0.184
#> 3 FirstPt.vs.First5.CaCO 0.494
#> 4 FirstPt.vs.First7.CaCO 0.354
#> 5 FirstPt.vs.First10.CaCO 0.893
#> 6 FirstPt.vs.Last3.CaCO 0.496
#> 7 FirstPt.vs.Last5.CaCO 0.909
#> 8 FirstPt.vs.Last7.CaCO 0.439
#> 9 FirstPt.vs.Last10.CaCO 0.146
#> 10 LastPt.vs.First3.CaCO 0.578
#> # … with 215 more rows
t.test.by.group.combos(data, vars(Variable, Location))
#> # A tibble: 674 x 2
#> test p.value
#> <chr> <dbl>
#> 1 FirstPt.vs.LastPt.CaCO.Alpha 0.850
#> 2 FirstPt.vs.First3.CaCO.Alpha 0.822
#> 3 FirstPt.vs.First5.CaCO.Alpha 0.895
#> 4 FirstPt.vs.First7.CaCO.Alpha 0.810
#> 5 FirstPt.vs.First10.CaCO.Alpha 0.645
#> 6 FirstPt.vs.Last3.CaCO.Alpha 0.870
#> 7 FirstPt.vs.Last5.CaCO.Alpha 0.465
#> 8 FirstPt.vs.Last7.CaCO.Alpha 0.115
#> 9 FirstPt.vs.Last10.CaCO.Alpha 0.474
#> 10 LastPt.vs.First3.CaCO.Alpha 0.991
#> # … with 664 more rows
This is kind of a lengthy function, but in general we group by the groups argument, then we select the groups and any integer columns, then we split the dataframe by the groups. After, we map all the combinations of variables and perform t.tests for each combo. Lastly, we rejoin all the groups into one dataframe.
I think this is what you want. The key was to use group_by and do from tidyverse.
df <- NULL
for(i in (4:(ncol(data)-1))){
for(j in ((i+1):ncol(data))){
df <- rbind(df,data %>%
group_by(Location) %>%
do(data.frame(pval = t.test(.[[i]],.[[j]], data = .)$p.value)) %>%
ungroup() %>%
mutate(Test = paste0(colnames(data)[i],'.vs.',colnames(data)[j]))
)
}
}
df$Test <- paste0(df$Test,'.In',df$Location)
Probably, you can acheive what you want using the below code :
library(dplyr)
library(tidyr)
data %>%
pivot_longer(cols = FirstPt:Last10) %>%
group_by(Variable) %>%
summarise(p_value = list(combn(name, 2, function(x)
t.test(value[name == x[1]], value[name == x[2]])$p.value)),
test = list(combn(name, 2, paste, collapse = "_"))) %>%
unnest(cols = c(test, p_value))
# Variable p_value test
# <fct> <dbl> <chr>
# 1 CaCO 0.915 FirstPt_LastPt
# 2 CaCO 0.529 FirstPt_First3
# 3 CaCO 0.337 FirstPt_First5
# 4 CaCO 0.350 FirstPt_First7
# 5 CaCO 0.395 FirstPt_First10
# 6 CaCO 0.765 FirstPt_Last3
# 7 CaCO 0.204 FirstPt_Last5
# 8 CaCO 0.873 FirstPt_Last7
# 9 CaCO 0.479 FirstPt_Last10
#10 CaCO 1 FirstPt_FirstPt
# … with 24,740 more rows
To do it grouped by Location you can add that into group_by command and keep rest of the code as it is.

how to calculate cumsum with depreciation in a grouped dataframe?

I tried to calculate the cumsum with a depreciation rate.
I have a grouped dataframe with a column number.
I want to add the number one by one with depreciation.
If the rate is 1, then the cumsum function in base r is good enough.
But if not, let's say the rate of 0.5 (means each number will multiply by 0.5 to add the next number), cumsum is not enough.
I tried to write my own function to work with dplyr, but it fails.
library(tidyverse)
# dataframe
id=sample(1:5,25,replace=TRUE)
num=rnorm(25)
df=data.frame(id,num)
# my custom function
depre=function(data){
rate=0.5
r=nrow(data)
sl=data$num
nl=data$num
for (i in 2:r){
sl[i]=sl[i-1]*rate+nl[i]
}
return(sl)
}
# work with one group
df %>% filter(id==1) %>% depre(.)
# failed to work with dplyr
df %>% group_by(id) %>% mutate(sl=depre(.))
I expect the first element of column s, should be the same as in column num.
But the following ones, should be depreciate by times 0.5 and add next num.
It works in one group, but failed in multi-grouped dataframe.
The error message is: "Error: Column sl must be length 6 (the group size) or one, not 25".
I have no idea. Could anyone have a clue?
Thanks
Your function would work if you pass vector to your function instead of dataframe
depre <- function(num){
rate = 0.5
r= length(num)
sl = num
nl = num
for (i in 2:r){
sl[i]=sl[i-1]*rate+nl[i]
}
return(sl)
}
and then apply it by group.
library(dplyr)
df %>% group_by(id) %>% mutate(sl = depre(num))
We can split by 'id' and use the OP's function without any changes
library(dplyr)
library(purrr)
df %>%
group_split(id, keep = FALSE) %>%
map_df(~ tibble(id = .$id, sl = depre(.)))
# id sl
# <int> <dbl>
# 1 1 1.07
# 2 1 -0.776
# 3 1 -0.518
# 4 1 0.628
# 5 1 0.601
# 6 1 1.10
# 7 2 -0.734
# 8 2 -0.583
# 9 2 -0.437
#10 2 -3.45
# … with 15 more rows
or an option would be accumulate from purrr which would be more compact
out <- df %>%
group_by(id) %>%
mutate(sl = accumulate(num, ~ .y + .x * 0.5))
out
# A tibble: 25 x 3
# Groups: id [5]
# id num sl
# <int> <dbl> <dbl>
# 1 3 -0.784 -0.784
# 2 2 -0.734 -0.734
# 3 2 -0.216 -0.583
# 4 3 -0.335 -0.727
# 5 5 -1.09 -1.09
# 6 4 -0.0854 -0.0854
# 7 1 1.07 1.07
# 8 2 -0.145 -0.437
# 9 3 -1.17 -1.53
#10 5 -0.819 -1.36
# … with 15 more rows
out %>%
filter(id == 1)
# A tibble: 6 x 3
# Groups: id [1]
# id num sl
# <int> <dbl> <dbl>
#1 1 1.07 1.07
#2 1 -1.31 -0.776
#3 1 -0.129 -0.518
#4 1 0.887 0.628
#5 1 0.287 0.601
#6 1 0.800 1.10
Issue in the OP's function is that the input is the whole dataset and during the process of getting the number of rows, it uses nrow(data), which would be the total number of rows. With group_by, the dplyr convention is n() - giving the number of rows. By doing the group_split, the input data.frame is split into subset of data.frames and the nrow of those will work for the created function

Perform several t.tests simultaneously on tidy data in R

I have a dataset that looks like the following:
id samediff factor value
1 S give 3
1 S impact 4
2 S give 2
2 S impact 5
3 D give 1
3 D impact 4
4 D give 3
4 D impact 5
I would like to perform several t.tests to compare the means for each factor in the S (samediff) condition to the means for that same factor in the D (samediff) condition.
I know I could do this in the following way:
dfgive<-filter(df, factor == "give")
t.test(value~samediff, dfgive)
dfimpact<-filter(df, factor == "impact")
t.test(value~samediff, dfimpact)
Is there a way to perform several t.tests in fewer lines? In the actual dataset, there are several more factors than are included here. I would like to be able to conduct all the t.tests necessary without creating separate dataframes in the same way I've shown above.
To augment existing answers, you can use broom::tidy to tidy the output from the t.test, e.g.
library(tidyverse)
library(broom)
df %>%
group_by(factor) %>%
summarise(ttest = list(t.test(value ~ samediff))) %>%
mutate(ttest = map(ttest, tidy)) %>%
unnest() %>%
select(factor, estimate, estimate1, estimate2, p.value)
# # A tibble: 2 x 5
# factor estimate estimate1 estimate2 p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 give -0.5 2 2.5 0.712
# 2 impact 0 4.5 4.5 1
Here's a base-R approach:
results <- lapply(split(df, df$factor), function(X) {
out <- t.test(value ~ samediff, X)
data.frame(diff = out$statistic,
mean1 = out$estimate[1],
mean2 = out$estimate[2],
pval = out$p.value)
})
do.call(rbind, results)
# diff mean1 mean2 pval
# give -0.4472136 2.0 2.5 0.7117228
# impact 0.0000000 4.5 4.5 1.0000000
We can split the data by factor and apply t.test one by one. The final output is a list. We can access the result by lst$give or lst$impact.
library(tidyverse)
lst <- df %>%
split(.$factor) %>%
map(~t.test(value ~ samediff, .x))
DATA
df <- read.table(text = "id samediff factor value
1 S give 3
1 S impact 4
2 S give 2
2 S impact 5
3 D give 1
3 D impact 4
4 D give 3
4 D impact 5 ",
header = TRUE, stringsAsFactors = FALSE)
We can group by 'factor' and summarise the output of t.test in a list
library(dplyr)
out <- df %>%
group_by(factor) %>%
summarise(ttest = list(t.test(value ~ samediff)))
out
# A tibble: 2 x 2
# factor ttest
# <chr> <list>
#1 give <S3: htest>
#2 impact <S3: htest>
The output is stored in a list column which can be extracted with $ or [[
identical(out$ttest[[1]], t.test(value ~ samediff, dfgive))
#[1] TRUE

Creating column that is a proportion of two conditions

I have a data frame with about 50 variables but where the ones in the example under are the most important. My aim is to create a table that includes various elements split by department and gender. The combination of dplyr, group_by and summarise gives me most of what I need but I haven't been able to figure out how to get separate columns that shows for example meanFemaleSalary/meanMaleSalary per department. I'm able to get the mean salary per gender per department in separate data frames, but either get an error or just a single value when I try to divide them.
I have tried searching the site and found what I believed was similar questions but couldn't get any of the answers to work. I'd be grateful if anyone could give me a hint on how to proceed…
Thanks!
Example:
library(dplyr)
x <- data.frame(Department = rep(c("Dep1", "Dep2", "Dep3"), times=2),
Gender = rep(c("F", "M"), times=3),
Salary = seq(10,15))
This is what I have that actually works so far:
Table <- x %>% group_by(Department, Gender) %>% summarise(Count = n(),
AverageSalary = mean(Salary, na.rm = T),
MedianSalary = median(Salary, na.rm = T))
I'd like two additional columns for AvgSalaryWomen/Men and MedianSalaryWomen/Men.
Again thanks!
If you want the new columns to be part of Table you could do something like this. But it will result in the value being repeated per department.
Table %>% group_by(Department) %>%
mutate(`AvgSalaryWomen/Men` = AverageSalary[Gender == "F"]/AverageSalary[Gender == "M"],
`MedianSalaryWomen/Men` = MedianSalary[Gender == "F"]/MedianSalary[Gender == "M"])
# Department Gender Count AverageSalary MedianSalary `AvgSalaryWomen/Men` `MedianSalaryWomen/Men`
# <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
# 1 Dep1 F 1 10. 10 0.769 0.769
# 2 Dep1 M 1 13. 13 0.769 0.769
# 3 Dep2 F 1 14. 14 1.27 1.27
# 4 Dep2 M 1 11. 11 1.27 1.27
# 5 Dep3 F 1 12. 12 0.800 0.800
# 6 Dep3 M 1 15. 15 0.800 0.800
If you want just one row per department simply change mutate to summarise and you'll get
# Department `AvgSalaryWomen/Men` `MedianSalaryWomen/Men`
# <fct> <dbl> <dbl>
# 1 Dep1 0.769 0.769
# 2 Dep2 1.27 1.27
# 3 Dep3 0.800 0.800
Here is an option to get this by spreading it to wide format
library(tidyverse)
x %>%
spread(Gender, Salary) %>%
group_by(Department) %>%
summarise(`AvgSalaryWomen/Men` = mean(F)/mean(M),
`MedianSalaryWomen/Men` = median(F)/median(M))
# A tibble: 3 x 3
# Department `AvgSalaryWomen/Men` `MedianSalaryWomen/Men`
# <fctr> <dbl> <dbl>
# 1 Dep1 0.769 0.769
# 2 Dep2 1.27 1.27
# 3 Dep3 0.800 0.800 `
If you want to end up with a table that has one row per department and includes all of the descriptive statistics you're computing along the way, you probably need to convert to long, unite some columns to use as a key, go back to wide, and then add your ratios. Something like...
Table <- x %>%
group_by(Department, Gender) %>%
summarise(Count = n(),
AverageSalary = mean(Salary, na.rm = TRUE),
MedianSalary = median(Salary, na.rm = TRUE)) %>%
# convert to long form
gather(Quantity, Value, -Department, -Gender) %>%
# create a unified gender/measure column to use as the key in the next step
unite(Set, Gender, Quantity) %>%
# go back to wide, now with repeating columns by gender
spread(Set, Value) %>%
# compute the department-level quantities you want using those new cols
mutate(AverageSalaryWomenMen = F_AverageSalary/M_AverageSalary,
MedianSalaryWomenMen = F_MedianSalary/M_MedianSalary)

Resources