I am using rstatix to perform multiple t-tests on a dataset which works very well, but I also need Cohen's d. rstatix also includes a function to calculate Cohen's d, but it requires the original dataset, not the table generated by the t_test function.
library(tidyverse)
library(rstatix)
iris %>%
t_test(Petal.Width ~ Species, paired = FALSE, var.equal = FALSE)
Which gives me:
# A tibble: 3 × 10
.y. group1 group2 n1 n2 statistic df p p.adj p.adj.signif
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 Petal.Width setosa versicolor 50 50 -34.1 74.8 2.72e-47 5.44e-47 ****
2 Petal.Width setosa virginica 50 50 -42.8 63.1 2.44e-48 7.32e-48 ****
3 Petal.Width versicolor virginica 50 50 -14.6 89.0 2.11e-25 2.11e-25 ****
This output would be perfect if it had an additional column with Cohen's d for each t-test. How do I get Cohen's d?
We may do a join
library(dplyr)
library(rstatix)
iris %>%
cohens_d(Petal.Width ~ Species, paired = TRUE) %>%
inner_join(iris %>%
t_test(Petal.Width ~ Species, paired = FALSE, var.equal = FALSE)
)
Related
I am conducting kruskal-wallis rank tests with the package rstatix and agricolae. I have doubt on how to transform the results of different packages.
Here are my sample codes:
data("iris")
#### agricolae
library(agricolae)
print( kruskal(iris[, 1], factor(iris[, 5]), group=TRUE, p.adj="bonferroni") )
##
> $groups
iris[, 1] groups
virginica 114.21 a
versicolor 82.65 b
setosa 29.64 c
library(rstatix)
iris %>% dunn_test(Sepal.Length ~ Species)
> # A tibble: 3 x 9
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
1 Sepal.~ setosa versi~ 50 50 6.11 1.02e- 9 2.04e- 9 ****
2 Sepal.~ setosa virgi~ 50 50 9.74 2.00e-22 6.00e-22 ****
3 Sepal.~ versic~ virgi~ 50 50 3.64 2.77e- 4 2.77e- 4 ***
I want to transform the '**' to letters 'abc', and are there any better packages or ways to do the multivariate nonparametric test, pls show the codes, thanks!
I have dataset which shows Variables, calculation I want to perform (sum, no. of distinct values) and new variable names after the calculation.
library(dplyr)
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length n_distinct Petal.LengthNew
", header = T)
Manual Approach - Summarise by grouping of Species variable.
iris %>% group_by_at("Species") %>%
summarise(Sepal.Length2 = sum(Sepal.Length,na.rm = T),
Petal.LengthNew = n_distinct(Petal.Length, na.rm = T)
)
Automate via eval(parse( ))
x <- RefDf %>% mutate(Check = paste0(NewVariable, " = ", Calculation, "(", Variables, ", na.rm = T", ")")) %>% pull(Check)
iris %>% group_by_at("Species") %>% summarise(eval(parse(text = x)))
As of now it is returning -
Species `eval(parse(text = x))`
<fct> <int>
1 setosa 9
2 versicolor 19
3 virginica 20
It should return -
Species Sepal.Length2 Petal.LengthNew
<fct> <dbl> <int>
1 setosa 250. 9
2 versicolor 297. 19
3 virginica 329. 20
You can use parse_exprs:
library(tidyverse)
library(rlang)
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length n_distinct Petal.LengthNew
", header = T)
#
expr_txt <- set_names(str_c(RefDf$Calculation, "(", RefDf$Variables, ")"),
RefDf$NewVariable)
iris %>%
group_by_at("Species") %>%
summarise(!!!parse_exprs(expr_txt), .groups = "drop")
## A tibble: 3 x 3
#Species Sepal.Length2 Petal.LengthNew
#<fct> <dbl> <int>
#1 setosa 250. 9
#2 versicolor 297. 19
#3 virginica 329. 20
Updated
I found a way of sparing those extra lines.
This is just another way of getting your desired result. I'd rather create a function call for every row of your data set and then iterate over it beside the new column names to get to the desired output:
library(dplyr)
library(rlang)
library(purrr)
# First we create a new variable which is actually of type call in your data set
RefDf %>%
rowwise() %>%
mutate(Call = list(call2(Calculation, parse_expr(Variables)))) -> Rf
Rf
# A tibble: 2 x 4
# Rowwise:
Variables Calculation NewVariable Call
<chr> <chr> <chr> <list>
1 Sepal.Length sum Sepal.Length2 <language>
2 Petal.Length n_distinct Petal.LengthNew <language>
# Then we iterate over `NewVariable` and `Call` at the same time to set the new variable
# name and also evaluate the `call` at the same time
map2(Rf$NewVariable, Rf$Call, ~ iris %>% group_by(Species) %>%
summarise(!!.x := eval_tidy(.y))) %>%
reduce(~ left_join(.x, .y, by = "Species"))
# A tibble: 3 x 3
Species Sepal.Length2 Petal.LengthNew
<fct> <dbl> <int>
1 setosa 250. 9
2 versicolor 297. 19
3 virginica 329. 20
For the example dataset iris, I would like to compute a table that gives me the p values for a t-test comparing the species Sentosa and Versicolor to Virginia (i.e. Virginia would be the reference group/control)
Currently, I've processed the average values for columns (sepal length, sepal width, petal length, petal width) and am trying to do a t-test grouped by species against control.
as an example output would have these columns:
c=Sepal Width p value, Sepal length p value, Petal length p value, Petal width p value
Thanks in advance for all your help!
Edit 1:
Here is what I wrote applied to iris (which doesn't exactly fit). I basically cleaned up my data to only include certain independent variables, which is why I have so the %>%.
iris %>%
group_by(species) %>%
addcol = function(iris)%>%
Sepal.length.p.value = mutate(iris, function(t.test(vars(3), ~./[species == 'Sentosa'])))
and basically I did that for each of the independat variables.
You can try the following:
library(dplyr)
library(tidyr)
library(broom)
pivot_longer(iris,-Species) %>% group_by(name)
# A tibble: 600 x 3
# Groups: name [4]
Species name value
<fct> <chr> <dbl>
1 setosa Sepal.Length 5.1
2 setosa Sepal.Width 3.5
3 setosa Petal.Length 1.4
4 setosa Petal.Width 0.2
5 setosa Sepal.Length 4.9
6 setosa Sepal.Width 3
At this step, we have converted into long and group them according to the variable. It is a matter of applying a pairwise t.test within each group, and filtering out those you don't need. We can use broom for this:
res = pivot_longer(iris,-Species) %>% group_by(name) %>%
do(tidy(pairwise.t.test(.$value,.$Species,pool.sd =FALSE))) %>%
filter(group1=="virginica" | group2=="virginica")
# A tibble: 8 x 4
# Groups: name [4]
name group1 group2 p.value
<chr> <chr> <chr> <dbl>
1 Petal.Length virginica setosa 2.78e-49
2 Petal.Length virginica versicolor 4.90e-22
3 Petal.Width virginica setosa 7.31e-48
4 Petal.Width virginica versicolor 2.11e-25
5 Sepal.Length virginica setosa 1.19e-24
6 Sepal.Length virginica versicolor 1.87e- 7
7 Sepal.Width virginica setosa 9.14e- 9
8 Sepal.Width virginica versicolor 1.82e- 3
Note that I set pool.sd =FALSE in pairwise.t.test so that it would be similar to a t.test, but ideally, if you have many groups, and their variances are similar, it pays to use a pooled SD.
You can put this in wide format again:
pivot_wider(res,values_from=p.value,names_from=name)
# A tibble: 2 x 6
group1 group2 Petal.Length Petal.Width Sepal.Length Sepal.Width
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 virginica setosa 2.78e-49 7.31e-48 1.19e-24 0.00000000914
2 virginica versicolor 4.90e-22 2.11e-25 1.87e- 7 0.00182
This is a possible solution: Cycle through the variable names in iris using purrr::map_dfc() and within that map_dfc() you cycle through the treatment groups (versicolor and setosa) with purrr::map_dfr(). That way the results of the inner cycle are combined rowwise and the results of the outer cycle are combined columnwise.
var_names <- names(iris)
var_names <- var_names[-length(var_names)] # Last variable is the group/Species variable, we don't want to include that.
treat_group <- c(versicolor = "versicolor", setosa = "setosa") # Using a named vector here will help map_dfr() to give useful names to the rows, otherwise it would just be 1 and 2.
library(purrr)
library(dplyr)
map_dfc(var_names, function(x) {
map_dfr(treat_group, function(y) {
res <-
tibble(t.test(iris[x][iris$Species == "virginica",],
iris[x][iris$Species == y,])$p.value)
names(res) <- x
res
}, .id = "species")
}) %>%
select(-matches("[1-3]")) # drop columns with numeric characters in it, to get rid of repeated species columns
#> # A tibble: 2 x 5
#> species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor 1.87e- 7 0.00182 4.90e-22 2.11e-25
#> 2 setosa 3.97e-25 0.00000000457 9.27e-50 2.44e-48
You could just split your data into control and treatment groups and use dplyr::summarise within your groups to create a column that gives you the p-value of a t-test.
library(dplyr)
control <- iris %>%
filter(Species == "virginica")
dat <- iris %>%
group_by(Species) %>%
filter(Species != "virginica") %>%
summarise("Sepal Width p value" = t.test(Sepal.Width, control$Sepal.Width)$p.value,
"Sepal length p value" = t.test(Sepal.Length, control$Sepal.Length)$p.value,
"Petal length p value" = t.test(Petal.Length, control$Petal.Length)$p.value,
"Petal width p value" = t.test(Petal.Width, control$Petal.Width)$p.value)
With the output being:
# A tibble: 2 x 5
Species `Sepal Width p value` `Sepal length p value` `Petal length p value` `Petal width p value`
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 0.00000000457 3.97e-25 9.27e-50 2.44e-48
2 versicolor 0.00182 1.87e- 7 4.90e-22 2.11e-25
I want to run a linear regression on a data frame using the same dependent variable. A similar question was solved here. The problem is that aov function to implement ANOVA doesn't accept x and y as arguments (as far as I know). Is there a way to implement the analysis in a tidy way? So far I've tried something like:
library(tidyverse)
iris %>%
as_tibble() %>%
select(Sepal.Length, Species) %>%
mutate(foo_a = as_factor(sample(c("a", "b", "c"), nrow(.), replace = T)),
foo_b = as_factor(sample(c("d", "e", "f"), nrow(.), replace = T))) %>%
map(~aov(Sepal.Length ~ .x, data = .))
Created on 2019-02-12 by the reprex package (v0.2.1)
The desired output is three analysis: Sepal.Length and Species, Sepal.Length and foo_a and the last one Sepal.Length and foo_b. Is it possible or I am totally wrong?
One approach is to make this into a long-shaped data frame, group by the independent variable of interest, and use the "many models" approach. I usually prefer something like this over trying to do tidyeval across multiple columns—it just gives me a clearer sense of what's going on.
To save space, I'm working with iris_foo, which is your data as you created it up through the 2 mutate lines. Putting it into a long format gives you a key of the names of those three columns that will be used as independent variables in each of the aov calls.
library(tidyverse)
iris_foo %>%
gather(key, value, -Sepal.Length)
#> # A tibble: 450 x 3
#> Sepal.Length key value
#> <dbl> <chr> <chr>
#> 1 5.1 Species setosa
#> 2 4.9 Species setosa
#> 3 4.7 Species setosa
#> 4 4.6 Species setosa
#> 5 5 Species setosa
#> 6 5.4 Species setosa
#> 7 4.6 Species setosa
#> 8 5 Species setosa
#> 9 4.4 Species setosa
#> 10 4.9 Species setosa
#> # … with 440 more rows
From there, nest by key and create a new list-column of ANOVA models. This will be a list of aov objects. For simplicity with getting your models back out, you can drop the data column.
aov_models <- iris_foo %>%
gather(key, value, -Sepal.Length) %>%
group_by(key) %>%
nest() %>%
mutate(model = map(data, ~aov(Sepal.Length ~ value, data = .))) %>%
select(-data)
aov_models
#> # A tibble: 3 x 2
#> key model
#> <chr> <list>
#> 1 Species <S3: aov>
#> 2 foo_a <S3: aov>
#> 3 foo_b <S3: aov>
From there, you can work with the models however you like. They're accessible in the list aov_models$model. Printed, they look how you'd expect. For example, the first model:
aov_models$model[[1]]
#> Call:
#> aov(formula = Sepal.Length ~ value, data = .)
#>
#> Terms:
#> value Residuals
#> Sum of Squares 63.21213 38.95620
#> Deg. of Freedom 2 147
#>
#> Residual standard error: 0.5147894
#> Estimated effects may be unbalanced
To see all the models, call aov_models$model %>% map(print). You might also want to use broom functions, such as broom::tidy or broom::glance, depending on how you need to present the models.
aov_models$model %>%
map(broom::tidy)
#> [[1]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 63.2 31.6 119. 1.67e-31
#> 2 Residuals 147 39.0 0.265 NA NA
#>
#> [[2]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 0.281 0.141 0.203 0.817
#> 2 Residuals 147 102. 0.693 NA NA
#>
#> [[3]]
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 value 2 0.756 0.378 0.548 0.579
#> 2 Residuals 147 101. 0.690 NA NA
Or tidying all the models into a single data frame, which keeps the key column, you could do:
aov_models %>%
mutate(model_tidy = map(model, broom::tidy)) %>%
unnest(model_tidy)
Could I please get some help on the following. I have a data frame which has multiple groups which I would like to run a linear model on. As a test, I subset just one of the groups and ran the function lm() and got the following out put:
test <- filter(dat, locus == "ChrX_1")
test.result <- lm(methylation ~ Pheno, dat)
term estimate std.error statistic p.value
1 (Intercept) 56.955 0.9729203 58.540254 9.080525e-250
2 Pheno1 9.015 1.1915791 7.565591 1.464884e-13
I then used group_by from dplyr package to perform the lm() function on the different groups. But the output of the p.value of the locus "ChrX_1" is now different and weaker.
test.result4 <- group_by(dat, locus) %>%
do(model.test2 = lm(methylation ~ Pheno, data = .))
tidy(test.result4, model.test2)
locus term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 ChrX_1 (Intercept) 59.40 4.476666 13.268804 1.342225e-13
2 ChrX_1 Pheno1 9.05 5.482773 1.650624 1.099895e-01
3 ChrX_10 (Intercept) 59.00 4.069398 14.498459 1.522725e-14
4 ChrX_10 Pheno1 11.40 4.983974 2.287331 2.993721e-02
5 ChrX_11 (Intercept) 58.90 4.665565 12.624408 4.460131e-13
6 ChrX_11 Pheno1 9.10 5.714127 1.592544 1.224905e-01
7 ChrX_12 (Intercept) 52.80 3.717022 14.204921 2.526739e-14
8 ChrX_12 Pheno1 10.65 4.552403 2.339424 2.667444e-02
9 ChrX_13 (Intercept) 53.10 3.556734 14.929427 7.343091e-15
10 ChrX_13 Pheno1 7.10 4.356092 1.629901 1.143224e-01
# ... with 30 more rows
As such, I was wondering what is causing the weakening of the p.values? I thought the p.value should be the same as when I had subsetted the locus and ran the lm() function on it.
Thanks
As i mentioned in the comment, the issue is that you are not using the filtered data, instead you are using the entire dataset. Hence the mis-match.
Below is the code, with sample data, that shows no mismatch when using group_by and lm on it.
library(dplyr)
library(tidyr)
library(broom)
set.seed(123)
dat <- data.frame(methylation=runif(1000, min=10, max=200),
Pheno=runif(1000, min=10, max=200),
locus=sample(paste0("ChrX_", 1:10), 1000, replace=TRUE)
)
dat$locus <- as.character(dat$locus)
test <- filter(dat, locus == "ChrX_1")
test.result <- lm(methylation ~ Pheno, test)
summary(test.result)
test.result4 <- group_by(dat, locus) %>%
do(model.test2 = lm(methylation ~ Pheno, data = .))
tidy(test.result4, model.test2)
I tried it with iris and results for both approach are the same. there is something wrong with your group_by() line. Try it my way.
Look:
test <- filter(iris, Species=="setosa")
test.lm <- lm(Sepal.Length ~Sepal.Width, data=test)
Species term estimate std.error statistic p.value
<fctr> <chr> <dbl> <dbl> <dbl> <dbl>
1 setosa (Intercept) 2.6390012 0.31001431 8.512514 3.742438e-11
2 setosa Sepal.Width 0.6904897 0.08989888 7.680738 6.709843e-10
Then with group_by()
iris %>% group_by(Species) %>% do(tidy(lm(Sepal.Length~Sepal.Width, data=.)))
Species term estimate std.error statistic p.value
<fctr> <chr> <dbl> <dbl> <dbl> <dbl>
1 setosa (Intercept) 2.6390012 0.31001431 8.512514 3.742438e-11
2 setosa Sepal.Width 0.6904897 0.08989888 7.680738 6.709843e-10
3 versicolor (Intercept) 3.5397347 0.56287357 6.288685 9.069049e-08
4 versicolor Sepal.Width 0.8650777 0.20193757 4.283887 8.771860e-05
5 virginica (Intercept) 3.9068365 0.75706053 5.160534 4.656345e-06
6 virginica Sepal.Width 0.9015345 0.25310551 3.561892 8.434625e-04