compare_means is a straightforward function which I consider very useful:
library(ggpubr)
data("ToothGrowth")
df <- ToothGrowth
res <- compare_means(len ~ supp,
group.by = "dose",
data = df,
method = "wilcox.test", paired = FALSE)
However, to the best of my knowledge, it is not possible to obtain means and standard deviations (or standard errors) in the relative table of results.
> res
# A tibble: 3 × 9
dose .y. group1 group2 p p.adj p.format p.signif method
<dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 0.5 len OJ VC 0.0232 0.046 0.023 * Wilcoxon
2 1 len OJ VC 0.00403 0.012 0.004 ** Wilcoxon
3 2 len OJ VC 1 1 1.000 ns Wilcoxon
>
Which is the best way to obtain group 1 and group 2 means and SD/SE with few code lines? I would like to have means (SD) instead of groups' labels OJ/VC.
Based on the documentation, there are no specific arguments helpful to this aim.
UPDATE with my dirty dirty way:
library(ggpubr)
data("ToothGrowth")
df <- ToothGrowth
p <- ggbarplot(df, x = "supp", y = "len",
add = c("mean_sd"),
facet.by = "dose",
position = position_dodge(0.8))+
stat_compare_means(method = "wilcox.test", paired = FALSE)
# Extracting all ggplot infos
my_data <- ggplot_build(p)
# Extracting means and Standard Deviations from the plot
my_means_sd <- (my_data[["data"]][[2]])[,1:5]
my_means_sd$labs <- paste0(my_means_sd$y,
" (",
round(my_means_sd$ymin, 1),
"-",
round(my_means_sd$ymax, 1),
")")
my_means_sd <- my_means_sd[,c("x", "labs")]
# Manipulating dataframe
library(dplyr)
my_means_sd <- as.data.frame(my_means_sd %>%
group_by(x) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = x, values_from = labs) %>%
select(-row) )
# Extracting P values from plot
my_pvalues <- (my_data[["data"]][[3]])[,9:13]
res <- cbind(my_means_sd, my_pvalues)
The result I generated:
> res
1 2 p p.adj p.format p.signif method
1 13.23 (8.8-17.7) 7.98 (5.2-10.7) 0.023186427 0.023 0.023 * Wilcoxon
2 22.7 (18.8-26.6) 16.77 (14.3-19.3) 0.004030367 0.004 0.004 ** Wilcoxon
3 26.06 (23.4-28.7) 26.14 (21.3-30.9) 1.000000000 1.000 1 ns Wilcoxon
>
Something like this?
library(dplyr)
library(tidyr)
df %>%
group_by(supp, dose) %>%
summarise(mean = mean(len), sd = sd(len)) %>%
pivot_wider(
names_from = supp,
values_from = c(mean, sd)
) %>%
right_join(res, by="dose") %>%
select(-c(group1, group2, .y.))
dose mean_OJ mean_VC sd_OJ sd_VC p p.adj p.format p.signif method
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 0.5 13.2 7.98 4.46 2.75 0.0232 0.046 0.023 * Wilcoxon
2 1 22.7 16.8 3.91 2.52 0.00403 0.012 0.004 ** Wilcoxon
3 2 26.1 26.1 2.66 4.80 1 1 1.000 ns Wilcoxon
>
You should use an another function desc_statby.
We have a nice example:
library(ggpubr)
data("ToothGrowth")
res <- desc_statby(ToothGrowth, measure.var = "len",
grps = c("dose", "supp"))
head(res[, 1:10])
dose supp length min max median mean iqr mad sd
1 0.5 OJ 10 8.2 21.5 12.25 13.23 6.475 4.29954 4.459709
2 0.5 VC 10 4.2 11.5 7.15 7.98 4.950 3.55824 2.746634
3 1.0 OJ 10 14.5 27.3 23.45 22.70 5.350 3.92889 3.910953
4 1.0 VC 10 13.6 22.5 16.50 16.77 2.025 1.70499 2.515309
5 2.0 OJ 10 22.4 30.9 25.95 26.06 2.500 2.07564 2.655058
6 2.0 VC 10 18.5 33.9 25.95 26.14 5.425 4.59606 4.797731
You can take necessary info from both tables and make a result one table...
Related
I am trying to get the Long Term Average for the tertiles of all the variables in my dataframe.
Basically, I want the mean value for the three tertiles of each variable. Hereunder, an example for the second tertile.
Data structure:
DOY city P BP prune Tmax
100 Bechem 1.283868 0.001742587 1.00 29.07214
123 Bechem 0.000000 0.002593004 0.02 30.42596
345 Bechem 0.000000 0.421595923 0.02 30.03821
100 Bechem 1.283868 0.001742587 1.00 29.07214
123 Bechem 0.000000 0.002593004 0.02 30.42596
345 Bechem 0.000000 0.393785818 0.02 29.03066
100 Bechem 1.283868 0.001742587 1.00 29.07214
123 Bechem 0.000000 0.002593004 0.02 30.42596
345 Bechem 0.000000 0.174428952 0.02 32.00171
100 Bechem 1.283868 0.001742587 1.00 29.07214
What I do now is:
Fun_tertile_mean <- function(x,i){
quantile<-quantile(x, c(0:3/3))
datum <- mean(x[x<= quantile[i] & x>quantile[i-1]],na.rm = TRUE)
return(datum)
}
Data_General_2tertile <- Data_General[Data_General$yr_prjctd %in% c(2010,2011,2012,2013,2014,2015),] %>%
group_by(city) %>%
group_by(DOY) %>%
select_if(is.numeric) %>%
summarise_all(function(x) Fun_tertile_mean(x,3))
I obtain a dataframe full of NaN and some values and I do not understand if there is something wrong in the function I defined, in the interaction between group_by() and summarise_all().
Any tip or help is much appreciated!
Hello and welcome to SO!
I believe the problem is that you are getting tertile limits that are equal. If that is the case, you will get a NaN as a result of the mean. Here is an example with iris:
library(tidyr)
library(dplyr)
data("iris")
Fun_tertile_mean <- function(x, i) {
+ quantile <- quantile(x, c(0:3 / 3))
+ datum <- mean(x[x <= quantile[i] & x > quantile[i - 1]], na.rm = TRUE)
+ return(datum)
+ }
iris %>% group_by(Species) %>%
+ select_if(is.numeric) %>%
+ summarise_all(function(x) Fun_tertile_mean(x, 3))
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.05 3.42 1.5 NaN
2 versicolor 5.97 2.85 4.38 1.4
3 virginica 6.54 2.99 5.54 2.05
quantile(iris[iris$Species == "setosa", "Petal.Width"], c(0:3 / 3))
0% 33.33333% 66.66667% 100%
0.1 0.2 0.2 0.6
One solution could be using ifelse, and taking the value instead of the mean:
Fun_tertile_mean <- function(x, i) {
+ quantile <- quantile(x, c(0:3 / 3), digits = 10)
+ datum <- ifelse(quantile[i] == quantile[i - 1], quantile[i], mean(x[x <= quantile[i] & x > quantile[i - 1]], na.rm = TRUE))
+ return(datum)
+ }
iris %>% group_by(Species) %>%
+ select_if(is.numeric) %>%
+ summarise_all(function(x) Fun_tertile_mean(x, 3))
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.05 3.42 1.5 0.2
2 versicolor 5.97 2.85 4.38 1.4
3 virginica 6.54 2.99 5.54 2.05
Hope it helps
I am using WRS2 to carry out robust pairwise comparisons. But one problem is that it removes the group level names from the output dataframes and saves it in a different object.
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# comparisons
x$comp
#> Group Group psihat ci.lower ci.upper p.value
#> [1,] 1 2 -1.0 -3.440879 1.44087853 0.25984505
#> [2,] 1 3 -2.8 -5.536161 -0.06383861 0.04914871
#> [3,] 2 3 -1.8 -4.536161 0.93616139 0.17288911
# vector with group level names
x$fnames
#> [1] "placebo" "low" "high"
I can convert it to a tibble:
# converting to tibble
suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 -1 -3.44 1.44 0.260
#> 2 1 3 -2.8 -5.54 -0.0638 0.0491
#> 3 2 3 -1.8 -4.54 0.936 0.173
I would then like to replace the group column numeric values with actual names included in fnames (so map fnames[1] -> 1, fnames[2] -> 2, and so on).
So the final dataframe should look something like the following-
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
In this case, it was easy to just copy-paste the three values, but I want to have a generalizable approach where no matter the number of levels, it works. How can I do this using dplyr?
Using a named vector to match with tidyverse. This matches by value and not by the sequence of index i.e. if the value in 'Group' columns are not in a sequence or character, this would still work
library(dplyr)
as_tibble(x$comp, .name_repair = 'unique') %>%
mutate(across(starts_with("Group"),
~ setNames(x$fnames, seq_along(x$fnames))[as.character(.)]))
Does this fullfil your needs :
names <- c("A","B","C")
df = data.frame(group=c(1,2,3))
library(dplyr)
df %>% mutate(group = names[group])
group
1 A
2 B
3 C
Here's an approach using the recode function, with the recoding vector built programmatically from the data:
# Setup
set.seed(123)
library(WRS2)
library(tidyverse)
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# Create recoding vector
recode.vec = x$fnames %>% set_names(1:length(x$fnames))
# Recode columns
x.comp = x$comp %>%
as_tibble(.name_repair=make.unique) %>%
mutate(across(starts_with("Group"), ~recode(., !!!recode.vec)))
Output:
x.comp
#> # A tibble: 3 x 6
#> Group Group.1 psihat ci.lower ci.upper p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
Try this tidyverse approach formating data to long after extracting the objects as tibbles. You can use left_join() to get your groups as you want. Here the code to get something close to what you want:
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
#Transform to tibble
df1 <- suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#Extract labels
df2 <- tibble(treat=x$fnames) %>% mutate(value=1:n())
#Format to long df1
df1 <- df1 %>%
mutate(id=1:n()) %>%
pivot_longer(cols = c(group1,group2)) %>%
rename(group=name) %>% left_join(df2) %>% select(-value) %>%
pivot_wider(names_from = group,values_from=treat) %>% select(-id)
Output:
# A tibble: 3 x 6
psihat ci.lower ci.upper p.value group1 group2
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 -1 -3.44 1.44 0.260 placebo low
2 -2.8 -5.54 -0.0638 0.0491 placebo high
3 -1.8 -4.54 0.936 0.173 low high
I'm using R. I have the following problem: I need to perform over 100 trials the linear model for each subgroup k (with equal size) of my dataset and then I want to get the estimates of the parameters as the mean of the parameters of each subgroup over 100 trials.
I've developed the following code. I'm not sure I know how to store, in the two loop, the estimates of the parameters at each iteration that I will need to compute the mean. I used a list ("res"), but since at each repetition I have to store a vector maybe it is not a good choice:
# Define var-cov matrix
rho <- 0.5
row1 <- rho^(c(0:18))
row2 <- rho^(c(1,0:17))
row3 <- rho^(c(2:1,0:16))
row4 <- rho^(c(3:1,0:15))
row5 <- rho^(c(4:1,0:14))
row6 <- rho^(c(5:1,0:13))
row7 <- rho^(c(6:1,0:12))
row8 <- rho^(c(7:1,0:11))
row9 <- rho^(c(8:1,0:10))
row10 <- rho^(c(9:1,0:9))
row11 <- rho^(c(10:1,0:8))
row12 <- rho^(c(11:1,0:7))
row13 <- rho^(c(12:1,0:6))
row14 <- rho^(c(13:1,0:5))
row15 <- rho^(c(14:1,0:4))
row16 <- rho^(c(15:1,0:3))
row17 <- rho^(c(16:1,0:2))
row18 <- rho^(c(17:1,0:1))
row19 <- rho^(c(18:1,0))
S = round(rbind(row1,row2,row3,row4,row5,row6,row7,row8,row9,row10,row11,row12,row13,row14,row15,row16,row17,row18,row19),4)
library(tidyr)
colnames(S) = c("X2","X3","X4","X5","X6","X7","X8","X9","X10","X11","X12","X13","X14","X15","X16","X17","X18","X19","X20")
rownames(S) = colnames(S)
# Make mean vector
mus = rep(1,19); names(mus) = colnames(S)
res <- list()
result <- list()
for(ii in 1:100){
df = mvrnorm(n = 1000, mu = mus, Sigma = S)
beta <- c(1, runif(19, min = -2.5, max = 2.5))
eps <- rnorm(1000, 0, 1)
sigma <- 0.2*(norm(df*beta, type = '2')/norm(eps, type = '2'))
y <- rowSums(df*beta + sigma*eps)
df <- data.frame(cbind(y, df))
ind = sample(rep(1:10,each = nrow(df)/10)) # split the dataset in k=10 subgroups
k <-lapply(split(1:nrow(df),ind), function(i) df[i,])
for(i in 1:10){
fit <-lm(formula = y ~ X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12+X13+X14+X15+X16+X17+X18+X19+X20,
data= k[[i]])
res[[i]] <- fit$coefficients
}
result[[ii]] <- mean(res[[i]])
}
Could someone help me? Thank you in advance.
Maybe it helps to think about what is the structure you need. From what I can see, the result list can be calculated after you consolidated your coefficents. If you prefer to have it in a data.frame, and keep track of the simulation no, split no, then try this below:
library(purrr)
library(MASS)
library(dplyr)
library(broom)
regform =as.formula('y ~ X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12+X13+X14+X15+X16+X17+X18+X19+X20')
func = function(ii,mus,S,matrix=FALSE){
df = mvrnorm(n = 1000, mu = mus, Sigma = S)
beta <- c(1, runif(19, min = -2.5, max = 2.5))
eps <- rnorm(1000, 0, 1)
sigma <- 0.2*(norm(df*beta, type = '2')/norm(eps, type = '2'))
y <- rowSums(df*beta + sigma*eps)
df <- data.frame(cbind(y, df))
df$ind = sample(rep(1:10,each = nrow(df)/10))
df <- df %>% group_by(ind) %>% do(tidy(lm(regform,data=.))) %>% mutate(sim=ii)
if(matrix){
return(split(df$estimate,df$ind))
}else{
return(df)
}
}
result = 1:100 %>% map_dfr(~func(.x,mus=mus,S=S,matrix=FALSE))
> head(result)
# A tibble: 6 x 7
# Groups: ind [1]
ind term estimate std.error statistic p.value sim
<int> <chr> <dbl> <dbl> <dbl> <dbl> <int>
1 1 (Intercept) 13.7 13.3 1.02 0.309 1
2 1 X2 -11.1 5.51 -2.02 0.0467 1
3 1 X3 5.61 5.86 0.957 0.341 1
4 1 X4 -1.48 6.22 -0.239 0.812 1
5 1 X5 -3.82 5.89 -0.649 0.518 1
6 1 X6 2.88 5.95 0.485 0.629 1
> tail(result)
# A tibble: 6 x 7
# Groups: ind [1]
ind term estimate std.error statistic p.value sim
<int> <chr> <dbl> <dbl> <dbl> <dbl> <int>
1 10 X15 11.9 6.41 1.85 0.0679 100
2 10 X16 -8.86 5.77 -1.54 0.128 100
3 10 X17 6.68 5.70 1.17 0.245 100
4 10 X18 3.73 5.81 0.641 0.523 100
5 10 X19 -5.28 5.55 -0.952 0.344 100
6 10 X20 1.14 5.40 0.211 0.833 100
As mentioned, the mean of the coefficients you need is simply grouping by sim and ind:
result %>% group_by(sim,ind) %>% summarize(estimate=mean(estimate))
# A tibble: 1,000 x 3
# Groups: sim [100]
sim ind estimate
<int> <int> <dbl>
1 1 1 0.800
2 1 2 0.771
3 1 3 0.807
4 1 4 0.277
5 1 5 0.632
6 1 6 0.788
7 1 7 0.878
8 1 8 0.987
9 1 9 0.764
10 1 10 0.611
# … with 990 more rows
The above is I think cleaner and easier for keeping tracking.. downside is that it uses a data.frame and might be costly if you are doing tons of regressions.
The other possibility is to store everything in a matrix:
result = map(1:100,~func(.x,mus=mus,S=S,matrix=TRUE))
And to get the means:
map(result,~map(.x,mean))
I am trying to mutate a new variable depending on the mean function of each group. I tried running this code using the pre-loaded data frame "ToothGrowth" available in R.
Output Results are incorrect, seems like it is looping the means of each group as a list instead of assigning to each group.
A diagram showing what I am trying to achieve:
data("ToothGrowth")
head(ToothGrowth)
tg.tb01<-ToothGrowth %>%
group_by(supp, dose) %>% # mydata has 3 more variables.
summarise(mean = mean(len)) %>%
print()
ToothGrowth %>%
group_by(supp) %>%
mutate(submean2 = len - tg.tb01$mean/tg.tb01$dose)
You don't need another dataset to store the average. Look below:
library(dplyr)
library(datasets)
ToothGrowth %>%
group_by(supp, dose) %>%
mutate(lenmean = mean(len),
submean2 = len - lenmean/dose)
#> # A tibble: 60 x 5
#> # Groups: supp, dose [6]
#> len supp dose lenmean submean2
#> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1 4.2 VC 0.5 7.98 -11.8
#> 2 11.5 VC 0.5 7.98 -4.46
#> 3 7.3 VC 0.5 7.98 -8.66
#> 4 5.8 VC 0.5 7.98 -10.2
#> 5 6.4 VC 0.5 7.98 -9.56
#> 6 10 VC 0.5 7.98 -5.96
#> 7 11.2 VC 0.5 7.98 -4.76
#> 8 11.2 VC 0.5 7.98 -4.76
#> 9 5.2 VC 0.5 7.98 -10.8
#> 10 7 VC 0.5 7.98 -8.96
#> # ... with 50 more rows
If I understand correctly you should use instead. Please specify your desired output with numbers.
tg.tb01<-ToothGrowth %>%
group_by(supp, dose) %>%
mutate(mean = mean(len)) %>%
ungroup() %>%
group_by(supp) %>%
mutate(submean2 = len - mean/dose)
Please, I will like to calculate mean difference with confidence interval for two variables across another categorical variable.
I am interested in calculating the confidence intervals for p1, p2 and pdiff
Thanks a lot
library(tidyverse)
iris %>%
mutate(out1 = Sepal.Length < 6,
out2 = Sepal.Length < 5) %>%
group_by(Species) %>%
summarise(p1 = mean(out1),
p2 = mean(out2),
pdiff = p1 - p2)
# A tibble: 3 x 4
Species p1 p2 pdiff
<fct> <dbl> <dbl> <dbl>
1 setosa 1 0.4 0.6
2 versicolor 0.52 0.02 0.5
3 virginica 0.14 0.02 0.12
One way to get confidence intervals is via prop.test. You can run this test for each one of your metrics (p1, p2, diff) and then extract the information you want using map.
library(tidyverse)
iris %>%
mutate(out1 = Sepal.Length < 6,
out2 = Sepal.Length < 5) %>%
group_by(Species) %>%
summarise(p1 = mean(out1),
p2 = mean(out2),
pdiff = p1 - p2,
p1_test = list(prop.test(sum(out1), length(out1))), # create tests for p1, p2 and diff and save the outputs as list
p2_test = list(prop.test(sum(out2), length(out2))),
pdiff_test = list(prop.test(c(sum(out1),sum(out2)), c(length(out1),length(out2)))),
p1_low = map_dbl(p1_test, ~.$conf.int[1]), # extract low and high confidence intervals based on the corresponding test
p1_high = map_dbl(p1_test, ~.$conf.int[2]),
p2_low = map_dbl(p2_test, ~.$conf.int[1]),
p2_high = map_dbl(p2_test, ~.$conf.int[2]),
pdiff_low = map_dbl(pdiff_test, ~.$conf.int[1]),
pdiff_high = map_dbl(pdiff_test, ~.$conf.int[2])) %>%
select(-matches("test")) # remove test columns
# # A tibble: 3 x 10
# Species p1 p2 pdiff p1_low p1_high p2_low p2_high pdiff_low pdiff_high
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 setosa 1 0.4 0.6 0.911 1 0.267 0.548 0.444 0.756
# 2 versicolor 0.52 0.02 0.5 0.376 0.661 0.00104 0.120 0.336 0.664
# 3 virginica 0.14 0.02 0.12 0.0628 0.274 0.00104 0.120 -0.00371 0.244