How to combine dplyr group_by, summarise, across and multiple function outputs? - r

I have the following tibble:
tTest = tibble(Cells = rep(c("C1", "C2", "C3"), times = 3),
Gene = rep(c("G1", "G2", "G3"), each = 3),
Experiment_score = 1:9,
Pattern1 = 1:9,
Pattern2 = -(1:9),
Pattern3 = 9:1) %>%
group_by(Gene)
and I would like to correlate the Experiment_score with each of the Pattern columns for all Gene.
Looking at the tidyverse across page and examples, I thought this would work:
# `corList` is a simple wrapper for `cor` to have exactly two outputs:
corList = function(x, y) {
result = cor.test(x, y)
return(list(stat = result$estimate, pval = result$p.value))
}
tTest %>% summarise(across(starts_with("Pattern"), ~ corList(Experiment_score, .x), .names = "{.col}_corr_{.fn}"))
but I got this:
I have found a solution by melting the Pattern columns and I will post it down below for completeness but the challenge is that I have dozens of Pattern columns and millions of rows. If I melt the Pattern columns, I end up with half a billion rows, seriously hampering my ability to work with the data.
EDIT:
My own imperfect solution:
# `corVect` is a simple wrapper for `cor` to have exactly two outputs:
corVect = function(x, y) {
result = cor.test(x, y)
return(c(stat = result$estimate, pval = result$p.value))
}
tTest %>% pivot_longer(starts_with("Pattern"), names_to = "Pattern", values_to = "Strength") %>%
group_by(Gene, Pattern) %>%
summarise(CorrVal = corVect(Experiment_score, Strength)) %>%
mutate(CorrType = c("corr", "corr_pval")) %>%
# Reformat
pivot_wider(id_cols = c(Gene, Pattern), names_from = CorrType, values_from = CorrVal)

To get the desired result in one step, wrap the function return as a tibble rather than a list, and call .unpack = TRUE in across. Here using a conveniently-named corTibble function:
library(tidyverse)
tTest = tibble(
Cells = rep(c("C1", "C2", "C3"), times = 3),
Gene = rep(c("G1", "G2", "G3"), each = 3),
Experiment_score = 1:9,
Pattern1 = 1:9 + rnorm(9), # added some noise
Pattern2 = -(1:9 + rnorm(9)),
Pattern3 = 9:1 + rnorm(9)
) %>%
group_by(Gene)
corTibble = function(x, y) {
result = cor.test(x, y)
return(tibble(stat = result$estimate, pval = result$p.value))
}
tTest %>% summarise(across(
starts_with("Pattern"),
~ corTibble(Experiment_score, .x),
.names = "{.col}_corr",
.unpack = TRUE
))
#> # A tibble: 3 × 7
#> Gene Pattern1_corr_stat Pattern1_corr_pval Pattern2…¹ Patte…² Patte…³ Patte…⁴
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 G1 0.947 0.208 -0.991 0.0866 -1.00 0.0187
#> 2 G2 0.964 0.172 -0.872 0.325 -0.981 0.126
#> 3 G3 0.995 0.0668 -0.680 0.524 -0.409 0.732
#> # … with abbreviated variable names ¹​Pattern2_corr_stat, ²​Pattern2_corr_pval,
#> # ³​Pattern3_corr_stat, ⁴​Pattern3_corr_pval

Related

i want to write a custom function with tidyverse verbs/syntax that accepts the grouping parameters of my function as string

I want to write a function that has as parameters a data set, a variable to be grouped, and another parameter to be filtered. I want to write the function in such a way that I can afterwards apply map() to it and pass the variables to be grouped in to map() as a vector. Nevertheless, I don't know how my custom function rating() accepts the variables to be grouped as a string. This is what i have tried.
data = tibble(a = seq.int(1:10),
g1 = c(rep("blue", 3), rep("green", 3), rep("red", 4)),
g2 = c(rep("pink", 2), rep("hotpink", 6), rep("firebrick", 2)),
na = NA,
stat=c(23,43,53,2,43,18,54,94,43,87))
rating = function(data, by, no){
data %>%
select(a, {{by}}, stat) %>%
group_by({{by}}) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
fn(data = data, by = g2, no = 5) #this works
And this is the way i want to use my function
map(.x = c("g1", "g2"), .f = ~rating(data = data, by = .x, no = 1))
... but i get
Error: Must group by variables found in `.data`.
* Column `.x` is not found.
As we are passing character elements, it would be better to convert to symbol and evaluate (!!)
library(dplyr)
library(purrr)
rating <- function(data, by, no){
by <- rlang::ensym(by)
data %>%
select(a, !! by, stat) %>%
group_by(!!by) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
-testing
> map(.x = c("g1", "g2"), .f = ~rating(data = data, by = !!.x, no = 1))
[[1]]
# A tibble: 1 × 4
a g1 stat rank
<int> <chr> <dbl> <dbl>
1 1 blue 23 1
[[2]]
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 1 pink 23 1
It also works with unquoted input
> rating(data, by = g2, no = 5)
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 5 hotpink 43 3

Use pivot_longer to cast data to long with repeated column names

I have a df that is will be of nonfinite length. Example below only has 2 traits: "density" and "lipids", but other dfs may have 50 or more traits. Each trait has 3 columns associated with it: value.trait, unit.trait, method.trait. Seems very similiar to this example in vignette But when I run the code below I keep getting an error: Input must be a vector, not NULL
3 rows of sample data
x <- structure(list(geno_name = c("MB mixed", "MB mixed", "MB mixed"
), study_location = c("lab", "lab", "lab"), author = c("test",
"test", "test"), value.lipids = c(NA, 2.361463603, 1.461189384
), unit.lipids = c(NA, "g cm^-2", "g cm^-2"), method.lipids = c(NA,
"airbrush", "airbrush"), value.density = c(1.125257337, 0.816034359,
0.49559013), unit.density = c("g cm^-3", "g cm^-3", "g cm^-3"
), method.density = c("3D scanning", "3D scanning", "3D scanning"
)), row.names = c(NA, 3L), class = "data.frame")
Current pivot code:
x %>%
select(!c(study_location, author)) %>%
pivot_longer(cols = !geno_name,
names_to = c(".value", "trait"),
names_sep = ".",
values_drop_na = TRUE)
Error code:
Error: Input must be a vector, not NULL. Run rlang::last_error() to
see where the error occurred. In addition: Warning messages: 1: In
gsub(paste0("^", names_prefix), "", names(cols)) : argument
'pattern' has length > 1 and only the first element will be used 2:
Expected 2 pieces. Additional pieces discarded in 6 rows [1, 2, 3, 4,
5, 6].
We can also do
tidyr::pivot_longer(x,
cols = c(lipids, density),
names_to = c('.value', 'trait'),
names_sep = '[.]',
values_drop_na = TRUE)
Here's an approach that first makes the data longer, then splits out traits from unit/method, then spreads those.
x %>%
janitor::clean_names() %>% # This makes the column names distinct with #s
pivot_longer(cols = -(1:2),
names_to = "var",
values_to = "val",
values_transform = list(val = as.character)) %>%
mutate(trait = if_else(str_detect(var, "unit|method", negate = TRUE),
var, NA_character_),
# the regex below is meant to remove everything starting with _
stat = if_else(is.na(trait), var %>% str_remove("\\_[^.]*$"), "value")) %>%
fill(trait) %>%
select(-var) %>%
pivot_wider(names_from = stat, values_from = val)
# A tibble: 2 x 6
geno_name observation_id trait value unit method
<chr> <dbl> <chr> <chr> <chr> <chr>
1 MB mixed 10 lipids NA NA NA
2 MB mixed 10 density 1.125 g cm^-3 3D scanning
You can use pivot_longer as :
tidyr::pivot_longer(x,
cols = matches('lipids|density'),
names_to = c('.value', 'trait'),
names_sep = '\\.',
values_drop_na = TRUE)
# geno_name study_location author trait value unit method
# <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
#1 MB mixed lab test density 1.13 g cm^-3 3D scanning
#2 MB mixed lab test lipids 2.36 g cm^-2 airbrush
#3 MB mixed lab test density 0.816 g cm^-3 3D scanning
#4 MB mixed lab test lipids 1.46 g cm^-2 airbrush
#5 MB mixed lab test density 0.496 g cm^-3 3D scanning

R: How to perform multiple t.Tests when variable pairs contain NAs throughout?

R doesn't perform a t.test when there are too few observations. However, I need to compare two surveys, where one survey has information on all items, whereas in the other it is lacking in some variables. This leads to a t.test comparison of e.g. q1 from NA (group 1) with values (group 2).
Basically, I need to find out how the t.test is performed anyway but reports an error if the requirements are lacking. I need to perform multiple t.tests at the same time (q1-q4) with grouping variable group and report the p.values to an output file.
Thanks for your help!
#create data
surveydata <- as.data.frame(replicate(1,sample(1:5,1000,rep=TRUE)))
colnames(surveydata)[1] <- "q1"
surveydata$q2 <- sample(6, size = nrow(surveydata), replace = TRUE)
surveydata$q3 <- sample(6, size = nrow(surveydata), replace = TRUE)
surveydata$q4 <- sample(6, size = nrow(surveydata), replace = TRUE)
surveydata$group <- c(1,2)
#replace all value "6" wir NA
surveydata[surveydata == 6] <- NA
#add NAs to group 1 in q1
surveydata$q1[which(surveydata$q1==1 & surveydata$group==1)] = NA
surveydata$q1[which(surveydata$q1==2 & surveydata$group==1)] = NA
surveydata$q1[which(surveydata$q1==3 & surveydata$group==1)] = NA
surveydata$q1[which(surveydata$q1==4 & surveydata$group==1)] = NA
surveydata$q1[which(surveydata$q1==5 & surveydata$group==1)] = NA
#perform t.test
svy_sel <- c("q1", "q2", "q3", "q4", "group") #vector for selection
temp <- surveydata %>%
dplyr::select(svy_sel) %>%
tidyr::gather(key = variable, value = value, -group) %>%
dplyr::mutate(value = as.numeric(value)) %>%
dplyr::group_by(group, variable) %>%
dplyr::summarise(value = list(value)) %>%
tidyr::spread(group, value) %>% #convert from “long” to “wide” format
dplyr::group_by(variable) %>% #t-test will be applied to each member of this group (ie., each variable).
dplyr::mutate(p_value = t.test(unlist(1), unlist(2))$p.value, na.action = na.exclude)
Here's a base R way to get a tidy data frame of your results:
do.call(rbind, lapply(names(surveydata)[1:4], function(i) {
tryCatch({
test <- t.test(as.formula(paste(i, "~ group")), data = surveydata)
data.frame(question = i,
group1 = test$estimate[1],
group2 = test$estimate[2],
difference = diff(test$estimate),
p.value = test$p.value, row.names = 1)
}, error = function(e) {
data.frame(question = i,
group1 = NA,
group2 = NA,
difference = NA,
p.value = NA, row.names = 1)
})
}))
#> question group1 group2 difference p.value
#> 1 q1 NA NA NA NA
#> 11 q2 2.893720 3.128878 0.23515847 0.01573623
#> 12 q3 3.020930 3.038278 0.01734728 0.85905665
#> 13 q4 3.024213 3.066998 0.04278444 0.65910949
I'm not going to get into the debate about whether t tests are appropriate for Likert type data. I think the consensus is that with decent sized samples this should be OK.
You could also still do this with dplyr if you wrote a little function that would calculate the test if there was enough data. Here's the function that takes the entries from the dataset and calculates the p-value.
ttfun <- function(v1, v2, ...){
tmp <- data.frame(x = unlist(v1),
y = unlist(v2))
tmp <- na.omit(tmp)
if(nrow(tmp) < 2){
pv <- NA
}
else{
pv <- t.test(tmp$x,tmp$y, ...)$p.value
}
pv
}
Then, you could just call that in your last call to mutate():
svy_sel <- c("q1", "q2", "q3", "q4", "group") #vector for selection
temp <- surveydata %>%
dplyr::select(svy_sel) %>%
tidyr::gather(key = variable, value = value, -group) %>%
dplyr::mutate(value = as.numeric(value)) %>%
dplyr::group_by(group, variable) %>%
dplyr::summarise(value = list(value)) %>%
tidyr::spread(group, value) %>% #convert from “long” to “wide” format
dplyr::group_by(variable) %>% #t-test will be applied to each member of this group (ie., each variable).
dplyr::rename('v1'= '1', 'v2' = '2') %>%
dplyr::mutate(p_value = ttfun(v1, v2))
> temp
# # A tibble: 4 x 4
# # Groups: variable [4]
# variable v1 v2 p_value
# <chr> <list> <list> <dbl>
# 1 q1 <dbl [500]> <dbl [500]> NA
# 2 q2 <dbl [500]> <dbl [500]> 0.724
# 3 q3 <dbl [500]> <dbl [500]> 0.549
# 4 q4 <dbl [500]> <dbl [500]> 0.355

R package "infer" - Iterative bootstrapping / looping over column names

I'm bootstrapping with the infer package.
The statistic of interest is the mean, example data is given by a tibble with 3 columns and 5 rows. My real tibble has 86 rows and 40 columns. For every column I want to do a bootstrap simulation, like shown below for the column "x" in tibble "test_tibble".
library(infer)
library(tidyverse)
test_tibble <- tibble(x = 1:5, y = 6:10, z = 11:15)
# A tibble: 5 x 3
x y z
<int> <int> <int>
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
specify(test_tibble, response = x) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
# A tibble: 1 x 2
lower_CI upper_CI
<dbl> <dbl>
1 2.10 4
I am now looking for a way of doing the same thing for the other columns in my tibble. I have tried a for-loop like this:
for (i in 1:ncol(test_tibble)){
var_name <- names(test_tibble)[i]
specify(test_tibble, response = var_name) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
}
Unfortunately, this returns the follwing error
Error: The response variable `var_name` cannot be found in this dataframe.
Is there any way of iterating over the columns x, y and z without entering them manually as arguments for "response"? That'd be quite tedious for 40 columns.
This is a tricky question with a tricky answer.
Take a look at the response argument of the specify function in documentation:
The variable name in x that will serve as the response. This is alternative to using the formula argument.
With this in mind I modified the code to automate the process, adding one more column to the original dataframe and using the formula argument to obtain the same result, using a column of ones as explanatory variable.
library(infer)
library(tidyverse)
test_tibble <- tibble(x = 1:5, y = 6:10, z = 11:15, w = seq(1, 1, length.out = 5))
for (i in 1:ncol(test_tibble)){
var_name <- names(test_tibble)[i]
specify(test_tibble, formula = eval(parse(text = paste0(var_name, "~", "w"))))[, 1] %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
}
Hope it helps

Create t.test table with dplyr?

Suppose I have data that looks like this:
set.seed(031915)
myDF <- data.frame(
Name= rep(c("A", "B"), times = c(10,10)),
Group = rep(c("treatment", "control", "treatment", "control"), times = c(5,5,5,5)),
X = c(rnorm(n=5,mean = .05, sd = .001), rnorm(n=5,mean = .02, sd = .001),
rnorm(n=5,mean = .08, sd = .02), rnorm(n=5,mean = .03, sd = .02))
)
I want to create a t.test table with a row for "A" and one for "B"
I can write my own function that does that:
ttestbyName <- function(Name) {
b <- t.test(myDF$X[myDF$Group == "treatment" & myDF$Name==Name],
myDF$X[myDF$Group == "control" & myDF$Name==Name],
conf.level = 0.90)
dataNameX <- data.frame(Name = Name,
treatment = round(b$estimate[[1]], digits = 4),
control = round(b$estimate[[2]], digits = 4),
CI = paste('(',round(b$conf.int[[1]],
digits = 4),', ',
round(b$conf.int[[2]],
digits = 4), ')',
sep=""),
pvalue = round(b$p.value, digits = 4),
ntreatment = nrow(myDF[myDF$Group == "treatment" & myDF$Name==Name,]),
ncontrol = nrow(myDF[myDF$Group == "control" & myDF$Name==Name,]))
}
library(parallel)
Test_by_Name <- mclapply(unique(myDF$Name), ttestbyName)
Test_by_Name <- do.call("rbind", Test_by_Name)
and the output looks like this:
Name treatment control CI pvalue ntreatment ncontrol
1 A 0.0500 0.0195 (0.0296, 0.0314) 0.0000 5 5
2 B 0.0654 0.0212 (0.0174, 0.071) 0.0161 5 5
I'm wondering if there is a cleaner way of doing this with dplyr. I thought about using groupby, but I'm a little lost.
Thanks!
Not much cleaner, but here's an improvement:
library(dplyr)
ttestbyName <- function(myName) {
bt <- filter(myDF, Group=="treatment", Name==myName)
bc <- filter(myDF, Group=="control", Name==myName)
b <- t.test(bt$X, bc$X, conf.level=0.90)
dataNameX <- data.frame(Name = myName,
treatment = round(b$estimate[[1]], digits = 4),
control = round(b$estimate[[2]], digits = 4),
CI = paste('(',round(b$conf.int[[1]],
digits = 4),', ',
round(b$conf.int[[2]],
digits = 4), ')',
sep=""),
pvalue = round(b$p.value, digits = 4),
ntreatment = nrow(bt), # changes only in
ncontrol = nrow(bc)) # these 2 nrow() args
}
You should really replace the do.call function with rbindlist from data.table:
library(data.table)
Test_by_Name <- lapply(unique(myDF$Name), ttestbyName)
Test_by_Name <- rbindlist(Test_by_Name)
or, even better, use the %>% pipes:
Test_by_Name <- myDF$Name %>%
unique %>%
lapply(., ttestbyName) %>%
rbindlist
> Test_by_Name
Name treatment control CI pvalue ntreatment ncontrol
1: A 0.0500 0.0195 (0.0296, 0.0314) 0.0000 5 5
2: B 0.0654 0.0212 (0.0174, 0.071) 0.0161 5 5
An old question, but the broom package has since been made available for this exact purpose (as well as other statistical tests):
library(broom)
library(dplyr)
myDF %>% group_by(Name) %>%
do(tidy(t.test(X~Group, data = .)))
Source: local data frame [2 x 9]
Groups: Name [2]
Name estimate estimate1 estimate2 statistic p.value
(fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
1 A -0.03050475 0.01950384 0.05000860 -63.838440 1.195226e-09
2 B -0.04423181 0.02117864 0.06541046 -3.104927 1.613625e-02
Variables not shown: parameter (dbl), conf.low (dbl), conf.high (dbl)
library(tidyr)
library(dplyr)
myDF %>% group_by(Group) %>% mutate(rowname=1:n())%>%
spread(Group, X) %>%
group_by(Name) %>%
do(b = t.test(.$control, .$treatment)) %>%
mutate(
treatment = round(b[['estimate']][[2]], digits = 4),
control = round(b[['estimate']][[1]], digits = 4),
CI = paste0("(", paste(b[['conf.int']], collapse=", "), ")"),
pvalue = b[['p.value']]
)
# Name treatment control CI pvalue
#1 A 0.0500 0.0195 (-0.031677109707283, -0.0293323994902097) 1.195226e-09
#2 B 0.0654 0.0212 (-0.0775829100729602, -0.010880719830447) 1.613625e-02
You can add ncontrol, ntreatment manually.
You can do it with a custom t.test function and do:
my.t.test <- function(data, formula, ...)
{
tt <- t.test(formula=formula, data=data, ...)
ests <- tt$estimate
names(ests) <- sub("mean in group ()", "\\1",names(ests))
counts <- xtabs(formula[c(1,3)],data)
names(counts) <- paste0("n",names(counts))
cbind(
as.list(ests),
data.frame(
CI = paste0("(", paste(tt$conf.int, collapse=", "), ")"),
pvalue = tt$p.value,
stringsAsFactors=FALSE
),
as.list(counts)
)
}
myDF %>% group_by(Name) %>% do(my.t.test(.,X~Group))
Source: local data frame [2 x 7]
Groups: Name
Name control treatment CI pvalue ncontrol ntreatment
1 A 0.01950384 0.05000860 (-0.031677109707283, -0.0293323994902097) 1.195226e-09 5 5
2 B 0.02117864 0.06541046 (-0.0775829100729602, -0.010880719830447) 1.613625e-02 5 5

Resources