How to mutate by groups? - r

I am trying to mutate a new variable depending on the mean function of each group. I tried running this code using the pre-loaded data frame "ToothGrowth" available in R.
Output Results are incorrect, seems like it is looping the means of each group as a list instead of assigning to each group.
A diagram showing what I am trying to achieve:
data("ToothGrowth")
head(ToothGrowth)
tg.tb01<-ToothGrowth %>%
group_by(supp, dose) %>% # mydata has 3 more variables.
summarise(mean = mean(len)) %>%
print()
ToothGrowth %>%
group_by(supp) %>%
mutate(submean2 = len - tg.tb01$mean/tg.tb01$dose)

You don't need another dataset to store the average. Look below:
library(dplyr)
library(datasets)
ToothGrowth %>%
group_by(supp, dose) %>%
mutate(lenmean = mean(len),
submean2 = len - lenmean/dose)
#> # A tibble: 60 x 5
#> # Groups: supp, dose [6]
#> len supp dose lenmean submean2
#> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1 4.2 VC 0.5 7.98 -11.8
#> 2 11.5 VC 0.5 7.98 -4.46
#> 3 7.3 VC 0.5 7.98 -8.66
#> 4 5.8 VC 0.5 7.98 -10.2
#> 5 6.4 VC 0.5 7.98 -9.56
#> 6 10 VC 0.5 7.98 -5.96
#> 7 11.2 VC 0.5 7.98 -4.76
#> 8 11.2 VC 0.5 7.98 -4.76
#> 9 5.2 VC 0.5 7.98 -10.8
#> 10 7 VC 0.5 7.98 -8.96
#> # ... with 50 more rows

If I understand correctly you should use instead. Please specify your desired output with numbers.
tg.tb01<-ToothGrowth %>%
group_by(supp, dose) %>%
mutate(mean = mean(len)) %>%
ungroup() %>%
group_by(supp) %>%
mutate(submean2 = len - mean/dose)

Related

"fuzzy" inner_join in dplyr to keep both rows that do AND not exactly match

I am working with two datasets that I would like to join based not exact matches between them, but rather approximate matches. My question is similar to this OP.
Here are examples of what my two dataframes look like.
df1 is this one:
x
4.8
12
4
3.5
12.5
18
df2 is this one:
x y
4.8 6.6
12 1
4.5 1
3.5 0.5
13 1.8
15 2
I am currently using inner_join(df1, df2, by=c("x") to join the two together.
This gives me:
x y
4.8 6.6
12 1
3.5 0.5
However, what I really want to do is join the two dfs based on these conditions:
any exact matches are joined first (exactly like how inner_join() currently works)
BUT, if there are no exact matches, then join to any match ± 0.5
The kind of output I am trying to get would look like this:
x y
4.8 6.6
12 1
4 1 #the y value is from x=4.5 in df1
4 0.5 #the y value is from x=3.5 in df1
3.5 0.5
12.5 1 #the y value is from x=12 in df1
12.5 1.8 #the y value is from x=13 in df1
I typically work in dplyr, so a dplyr solution would be appreciated. But, I am also open to other suggestions because I don't know if dplyr will be flexible enough to do a "fuzzy" join.
(I am aware of the fuzzyjoin package, but it doesn't seem to get at exactly what I am trying to do here)
A possible solution, with no join:
library(tidyverse)
df1 %>%
rename(x1 = x) %>%
crossing(df2) %>%
mutate(diff = abs(x1-x)) %>%
filter(diff <= 0.5) %>%
group_by(x1) %>%
mutate(aux = any(diff == 0)) %>%
filter(aux*(diff == 0) | !aux) %>%
select(-diff, -aux) %>%
ungroup
#> # A tibble: 7 × 3
#> x1 x y
#> <dbl> <dbl> <dbl>
#> 1 3.5 3.5 0.5
#> 2 4 3.5 0.5
#> 3 4 4.5 1
#> 4 4.8 4.8 6.6
#> 5 12 12 1
#> 6 12.5 12 1
#> 7 12.5 13 1.8
You could use {powerjoin}
library(powerjoin)
power_left_join(
df1, df2,
by = ~ .x$x == .y$x | ! .x$x %in% .y$x & .x$x <= .y$x +.5 & .x$x >= .y$x -.5,
keep = "left")
#> x y
#> 1 4.8 6.6
#> 2 12.0 1.0
#> 3 4.0 1.0
#> 4 4.0 0.5
#> 5 3.5 0.5
#> 6 12.5 1.0
#> 7 12.5 1.8
Created on 2022-04-14 by the reprex package (v2.0.1)

How to compute means and sd using compare_means in ggpubr

compare_means is a straightforward function which I consider very useful:
library(ggpubr)
data("ToothGrowth")
df <- ToothGrowth
res <- compare_means(len ~ supp,
group.by = "dose",
data = df,
method = "wilcox.test", paired = FALSE)
However, to the best of my knowledge, it is not possible to obtain means and standard deviations (or standard errors) in the relative table of results.
> res
# A tibble: 3 × 9
dose .y. group1 group2 p p.adj p.format p.signif method
<dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 0.5 len OJ VC 0.0232 0.046 0.023 * Wilcoxon
2 1 len OJ VC 0.00403 0.012 0.004 ** Wilcoxon
3 2 len OJ VC 1 1 1.000 ns Wilcoxon
>
Which is the best way to obtain group 1 and group 2 means and SD/SE with few code lines? I would like to have means (SD) instead of groups' labels OJ/VC.
Based on the documentation, there are no specific arguments helpful to this aim.
UPDATE with my dirty dirty way:
library(ggpubr)
data("ToothGrowth")
df <- ToothGrowth
p <- ggbarplot(df, x = "supp", y = "len",
add = c("mean_sd"),
facet.by = "dose",
position = position_dodge(0.8))+
stat_compare_means(method = "wilcox.test", paired = FALSE)
# Extracting all ggplot infos
my_data <- ggplot_build(p)
# Extracting means and Standard Deviations from the plot
my_means_sd <- (my_data[["data"]][[2]])[,1:5]
my_means_sd$labs <- paste0(my_means_sd$y,
" (",
round(my_means_sd$ymin, 1),
"-",
round(my_means_sd$ymax, 1),
")")
my_means_sd <- my_means_sd[,c("x", "labs")]
# Manipulating dataframe
library(dplyr)
my_means_sd <- as.data.frame(my_means_sd %>%
group_by(x) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = x, values_from = labs) %>%
select(-row) )
# Extracting P values from plot
my_pvalues <- (my_data[["data"]][[3]])[,9:13]
res <- cbind(my_means_sd, my_pvalues)
The result I generated:
> res
1 2 p p.adj p.format p.signif method
1 13.23 (8.8-17.7) 7.98 (5.2-10.7) 0.023186427 0.023 0.023 * Wilcoxon
2 22.7 (18.8-26.6) 16.77 (14.3-19.3) 0.004030367 0.004 0.004 ** Wilcoxon
3 26.06 (23.4-28.7) 26.14 (21.3-30.9) 1.000000000 1.000 1 ns Wilcoxon
>
Something like this?
library(dplyr)
library(tidyr)
df %>%
group_by(supp, dose) %>%
summarise(mean = mean(len), sd = sd(len)) %>%
pivot_wider(
names_from = supp,
values_from = c(mean, sd)
) %>%
right_join(res, by="dose") %>%
select(-c(group1, group2, .y.))
dose mean_OJ mean_VC sd_OJ sd_VC p p.adj p.format p.signif method
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 0.5 13.2 7.98 4.46 2.75 0.0232 0.046 0.023 * Wilcoxon
2 1 22.7 16.8 3.91 2.52 0.00403 0.012 0.004 ** Wilcoxon
3 2 26.1 26.1 2.66 4.80 1 1 1.000 ns Wilcoxon
>
You should use an another function desc_statby.
We have a nice example:
library(ggpubr)
data("ToothGrowth")
res <- desc_statby(ToothGrowth, measure.var = "len",
grps = c("dose", "supp"))
head(res[, 1:10])
dose supp length min max median mean iqr mad sd
1 0.5 OJ 10 8.2 21.5 12.25 13.23 6.475 4.29954 4.459709
2 0.5 VC 10 4.2 11.5 7.15 7.98 4.950 3.55824 2.746634
3 1.0 OJ 10 14.5 27.3 23.45 22.70 5.350 3.92889 3.910953
4 1.0 VC 10 13.6 22.5 16.50 16.77 2.025 1.70499 2.515309
5 2.0 OJ 10 22.4 30.9 25.95 26.06 2.500 2.07564 2.655058
6 2.0 VC 10 18.5 33.9 25.95 26.14 5.425 4.59606 4.797731
You can take necessary info from both tables and make a result one table...

R - how to split into terciles during group_by

Let's say we want to calculate the means of sepal length based on tercile groups of sepal width.
We can use the split_quantile function from the fabricatr package and do the following:
iris %>%
group_by(split_quantile(Sepal.Width, 3)) %>%
summarise(Sepal.Length = mean(Sepal.Length))
So far so good. Now, let's say we want to group_by(Species, split_quantile(Sepal.Width, 3)) instead of just group_by(split_quantile(Sepal.Width, 3)).
However, what if we want the terciles to be calculated inside of the each species type and not generally?
Basically, what I'm looking for could be achieved by splitting iris into several dataframes based on Species, using split_quantile on those dataframes to calculate terciles and then joining the dataframes back together. However, I'm looking for a way to do this without splitting the dataframe.
You kinda have written the answer in your text, but you can create a new variable for tercile after grouping by species, then regroup with both Species and Tercile.
library(tidyverse)
library(fabricatr)
iris %>%
group_by(Species) %>%
mutate(Tercile = split_quantile(Sepal.Width, 3)) %>%
group_by(Species, Tercile) %>%
summarise(Sepal.Length = mean(Sepal.Length))
#> # A tibble: 9 x 3
#> # Groups: Species [3]
#> Species Tercile Sepal.Length
#> <fct> <fct> <dbl>
#> 1 setosa 1 4.69
#> 2 setosa 2 5.08
#> 3 setosa 3 5.27
#> 4 versicolor 1 5.61
#> 5 versicolor 2 6.12
#> 6 versicolor 3 6.22
#> 7 virginica 1 6.29
#> 8 virginica 2 6.73
#> 9 virginica 3 6.81
Created on 2020-05-27 by the reprex package (v0.3.0)

Predicting values with dplyr and augment

I'd like to fit models to a grouped data frame and then predict one new value per model (i.e. group).
library(dplyr)
library(broom)
data(iris)
dat <- rbind(iris, iris)
dat$Group <- rep(c("A", "B"), each = 150)
new.dat <- data.frame(Group = rep(c("A", "B"), each = 3),
Species = rep(c("setosa", "versicolor", "virginica"), times = 2),
Sepal.Width = 1:6)
> new.dat
Group Species val
1 A setosa 1
2 A versicolor 2
3 A virginica 3
4 B setosa 4
5 B versicolor 5
6 B virginica 6
However, augment returns 36 rows, as if each new value is fit with each model. How can I preserve the grouping here and get one fitted value per group?
dat %>%
group_by(Species, Group) %>%
do(augment(lm(Sepal.Length ~ Sepal.Width, data = .), newdata = new.dat))
# A tibble: 36 x 5
# Groups: Species, Group [6]
Group Species Sepal.Width .fitted .se.fit
<fct> <fct> <int> <dbl> <dbl>
1 A setosa 1 3.33 0.221
2 A versicolor 2 4.02 0.133
3 A virginica 3 4.71 0.0512
4 B setosa 4 5.40 0.0615
5 B versicolor 5 6.09 0.145
6 B virginica 6 6.78 0.234
7 A setosa 1 3.33 0.221
8 A versicolor 2 4.02 0.133
9 A virginica 3 4.71 0.0512
10 B setosa 4 5.40 0.0615
# ... with 26 more rows
(Note that due to the example data the rows are actually duplicates, which is however not the case with my original data).
You need to make the Species and Group of new.dat match those of the group currently being processed in do. You can do this like so:
group.cols <- c("Species", "Group")
dat %>%
group_by(!!! group.cols) %>%
do(augment(lm(Sepal.Length ~ Sepal.Width, data = .),
newdata = semi_join(new.dat, ., by = group.cols)))

R dplyr: Mutate with Reduce with init, after group_by

Is is possible to specify an initial value for Reduce without it being added into the dataframe?
For example, with function:
f <- function(x, y) if (y<0) -x * y else x + y
Acting on data frame:
set.seed(0)
df <- c(-0.9, sample(c(-0.9, 1:3), 9, replace = TRUE)) %>% tibble()
names(df) <- "x"
df <- df %>% mutate(id = 'a')
df$id[6:10] <- 'b'
df <- df %>% group_by(id) %>% mutate(sumprod = Reduce(f, x, acc=TRUE)) %>% ungroup()
df$target <- c(0, 3, 4, 5, 7, 3, 2.7, 5.7, 8.7, 10.7)
df
# A tibble: 10 x 4
x id sumprod target
<dbl> <chr> <dbl> <dbl>
1 -0.9 a -0.9 0.0
2 3.0 a 2.1 3.0
3 1.0 a 3.1 4.0
4 1.0 a 4.1 5.0
5 2.0 a 6.1 7.0
6 3.0 b 3.0 3.0
7 -0.9 b 2.7 2.7
8 3.0 b 5.7 5.7
9 3.0 b 8.7 8.7
10 2.0 b 10.7 10.7
The goal is column target. I've tried using init with Reduce, however that adds an extra element.
Reduce(f, df$x[1:5], acc=TRUE, init=0)
[1] 0 0 3 4 5 7
Using this within mutate produces an error:
> df <- df %>% group_by(id) %>% mutate(sumprod = Reduce(f, x, acc=TRUE, init=0)) %>% ungroup()
Error in mutate_impl(.data, dots) :
Column `sumprod` must be length 5 (the group size) or one, not 6
If init is given, Reduce logically adds it to the start (when proceeding left to right) or the end of x, respectively. If you don't need the element, you can use tail(..., -1) to remove the first element:
df %>%
group_by(id) %>%
mutate(sumprod = tail(Reduce(f, x, acc=TRUE, init=0), -1)) %>%
ungroup()
# A tibble: 10 x 4
# x id sumprod target
# <dbl> <chr> <dbl> <dbl>
# 1 -0.9 a 0.0 0.0
# 2 3.0 a 3.0 3.0
# 3 1.0 a 4.0 4.0
# 4 1.0 a 5.0 5.0
# 5 2.0 a 7.0 7.0
# 6 3.0 b 3.0 3.0
# 7 -0.9 b 2.7 2.7
# 8 3.0 b 5.7 5.7
# 9 3.0 b 8.7 8.7
#10 2.0 b 10.7 10.7
With tidyverse, there is accumulate from purrr
library(tidyverse)
df %>%
group_by(id) %>%
mutate(sumprod = accumulate(.x = x, .f = f, .init = 0)[-1]) %>%
ungroup
# A tibble: 10 x 3
# x id sumprod
# <dbl> <chr> <dbl>
# 1 -0.9 a 0.0
# 2 3.0 a 3.0
# 3 1.0 a 4.0
# 4 1.0 a 5.0
# 5 2.0 a 7.0
# 6 3.0 b 3.0
# 7 -0.9 b 2.7
# 8 3.0 b 5.7
# 9 3.0 b 8.7
#10 2.0 b 10.7

Resources