Dplyr: Count number of observations in group and summarise? - r

Im wondering if there is a more elegant way to perform this.
Right now, I am grouping all observations by Species. Then I summarize the median values.
median <- iris %>%
group_by(Species) %>%
summarise(medianSL = median(Sepal.Length),
medianSW = median(Sepal.Width),
medianPL = median(Petal.Length),
medianPW = median(Petal.Width))
I also wanted a column (n) that shows the amount of flowers in each row:
median_n <- iris %>%
group_by(Species) %>%
tally()
Can I combine these two code chunks? So that way the above code chunk will generate a table with the median lengths AND the total n for each row?

We may use across in summarise to loop over the numeric columns to get the median as well as create a frequency count with n() outside the across
library(dplyr)
library(stringr)
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric),
~ median(.x, na.rm = TRUE),
.names = "median{str_remove_all(.col, '[a-z.]+')}"),
n = n(), .groups = "drop")
-output
# A tibble: 3 × 6
Species medianSL medianSW medianPL medianPW n
<fct> <dbl> <dbl> <dbl> <dbl> <int>
1 setosa 5 3.4 1.5 0.2 50
2 versicolor 5.9 2.8 4.35 1.3 50
3 virginica 6.5 3 5.55 2 50

Related

Group by or sum more column [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 10 months ago.
I would like to know if exist a method "automatic" for calculted more column in the same time.
library(dplyr)
abc <- iris %>%
group_by(Species) %>%
summarise(abc = sum(Petal.Width)) %>%
ungroup()
abc2 <- iris %>%
group_by(Species) %>%
summarise(abc = sum(Sepal.Width)) %>%
ungroup()
I can use this code for each column but if i need to do more column (in this dataset the first four), how can I do? And I need in the same dataset, it is possible?
Try this:
iris %>%
group_by(Species) %>%
summarise(across(Sepal.Length:Petal.Width, sum))
# A tibble: 3 x 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 250. 171. 73.1 12.3
2 versicolor 297. 138. 213 66.3
3 virginica 329. 149. 278. 101.

tidyverse and a $ subsetting pecularity? intent for such behavior?

cdata is a tibble (I used haven to import a .sav file into the cdata object).
Why does using cdata$WEIGHT instead of WEIGHT produce such a radical difference in the output below?
this code uses cdata$WEIGHT :
cdata %>% group_by(as.factor(state)) %>%
summarise(n = n(), weighted_n = sum(cdata$WEIGHT))
produces an unwanted table:
this code uses WEIGHT :
cdata %>% group_by(as.factor(state)) %>%
summarise(n = n(), weighted_n = sum(WEIGHT))
produces the correct table:
I realize that tibble has a different mental model than base R. However, the above difference doesn't make intuitive sense to me. What's the intent behind this difference in output when using a common column identification technique (cdata$WEIGHT)?
When we having a grouping variable, cdata$WEIGHT extracts the whole column and thus the sum is from the whole column whereas if we use only WEIGHT, it returns only the data from the column for each group
If we really wanted to use $, then use the pronoun .data
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(.data$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
which is identical to
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
Or use cur_data()
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(cur_data()$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 250.
2 versicolor 297.
3 virginica 329.
Whereas if we use .$ or iris$, it extracts the whole column breaking the group attributes
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = sum(.$Sepal.Length), .groups = 'drop')
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 876.
2 versicolor 876.
3 virginica 876.

Perform a different simple custom function based on group

I have data with three groups and would like to perform a different custom function on each of the three groups. Rather than write three separate functions, and calling them all separately, I'm wondering whether I can easily wrap all three into one function with a 'group' parameter.
For example, say I want the mean for group A:
library(tidyverse)
data(iris)
iris$Group <- c(rep("A", 50), rep("B", 50), rep("C", 50))
f_a <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(mean = mean(Sepal.Length))
return(out)
}
The median for group B
f_b <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(median = median(Sepal.Length))
return(out)
}
And the standard deviation for group C
f_c <- function(df){
out <- df %>%
group_by(Species) %>%
summarise(sd= sd(Sepal.Length))
return(out)
}
Is there any way I can combine the above functions and run them according to a group parameter?? Like:
fx(df, group = "A")
Which would produce the results of the above f_a function??
Keeping in mind that in my actual use context, I can't simply group_by(group) in the original function, since the actual functions are more complex. Thanks!!
We create a switch inside the function to select the appropriate function to be applied based on the matching input from group. This function is passed into summarise to apply after groupihg by 'Species'
fx <- function(df, group) {
fn_selector <- switch(group,
A = "mean",
B = "median",
C = "sd")
df %>%
group_by(Species) %>%
summarise(!! fn_selector :=
match.fun(fn_selector)(Sepal.Length), .groups = 'drop')
}
-testing
fx(iris, "A")
# A tibble: 3 x 2
# Species mean
# <fct> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
fx(iris, "B")
# A tibble: 3 x 2
# Species median
# <fct> <dbl>
#1 setosa 5
#2 versicolor 5.9
#3 virginica 6.5
fx(iris, "C")
# A tibble: 3 x 2
# Species sd
# <fct> <dbl>
#1 setosa 0.352
#2 versicolor 0.516
#3 virginica 0.636
I don't understand the point of having group column in the dataset. When we pass group = "A" in the function this has got nothing to do with group column that was created.
Instead of passing group = "A" in the function and then mapping A to some function you can directly pass the function that you want to apply.
library(dplyr)
f_a <- function(df, fn){
out <- df %>%
group_by(Species) %>%
summarise(out = fn(Sepal.Length))
return(out)
}
f_a(iris, mean)
# A tibble: 3 x 2
# Species out
#* <fct> <dbl>
#1 setosa 5.01
#2 versicolor 5.94
#3 virginica 6.59
f_a(iris, median)
# A tibble: 3 x 2
# Species out
#* <fct> <dbl>
#1 setosa 5
#2 versicolor 5.9
#3 virginica 6.5

Group t test result into columns within tidyverse

I'd like to group multiple t test result into one table. Originally my code looks like this:
tt_data <- iris %>%
group_by(Species) %>%
summarise(p = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)$p.value,
estimate = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)$estimate
)
tt_data
# Species p estimate
# setosa 2.542887e-51 3.544
# versicolor 9.667914e-36 1.676
# virginica 7.985259e-28 1.036
However, base on the idea that I should only perform the statistical test once, is there a way for me to run t test once per group and collect the intended table? I think there are some combination of broom and purrr but I am unfamiliar with the syntax.
# code idea (I know this won't work!)
tt_data <- iris %>%
group_by(Species) %>%
summarise(tt = t.test(Sepal.Length,Petal.Length,alternative="two.sided",paired=T)) %>%
select(Species, tt.p, tt.estimate)
tt_data
# Species tt.p tt.estimate
# setosa 2.542887e-51 3.544
# versicolor 9.667914e-36 1.676
# virginica 7.985259e-28 1.036
You can use broom::tidy() to transform the resut of the t.test to a tidy 'tibble':
library(dplyr)
library(broom)
iris %>%
group_by(Species) %>%
group_modify(~{
t.test(.$Sepal.Length,.$Petal.Length,alternative="two.sided",paired=T) %>%
tidy()
}) %>%
select(estimate, p.value)
#> Adding missing grouping variables: `Species`
#> # A tibble: 3 x 3
#> # Groups: Species [3]
#> Species estimate p.value
#> <fct> <dbl> <dbl>
#> 1 setosa 3.54 2.54e-51
#> 2 versicolor 1.68 9.67e-36
#> 3 virginica 1.04 7.99e-28
Created on 2020-09-02 by the reprex package (v0.3.0)
You can use map to select the desired values from the list generated by t.test and by tidying it up to a data frame via broom::tidy, i.e.
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(p = list(broom::tidy(t.test(Sepal.Length, Petal.Length, alternative = "two.sided", paired = T)))) %>%
mutate(p.value = purrr::map(p, ~select(.x, c('p.value', 'estimate')))) %>%
select(-p) %>%
unnest()
# A tibble: 3 x 3
# Species p.value estimate
# <fct> <dbl> <dbl>
#1 setosa 2.54e-51 3.54
#2 versicolor 9.67e-36 1.68
#3 virginica 7.99e-28 1.04

How to use "summarise" from dplyr with dynamic column names?

I am summarizing group means from a table using the summarize function from the dplyr package in R. I would like to do this dynamically, using a column name string stored in another variable.
The following is the "normal" way and it works, of course:
myTibble <- group_by( iris, Species)
summarise( myTibble, avg = mean( Sepal.Length))
# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
However, I would like to do something like this instead:
myTibble <- group_by( iris, Species)
colOfInterest <- "Sepal.Length"
summarise( myTibble, avg = mean( colOfInterest))
I've read the Programming with dplyr page, and I've tried a bunch of combinations of quo, enquo, !!, .dots=(...), etc., but I haven't figured out the right way to do it yet.
I'm also aware of this answer, but, 1) when I use the standard-evaluation function standardise_, R tells me that it's depreciated, and 2) that answer doesn't seem elegant at all. So, is there a good, easy way to do this?
Thank you!
1) Use !!sym(...) like this:
colOfInterest <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarize(avg = mean(!!sym(colOfInterest))) %>%
ungroup
giving:
# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
2) A second approach is:
colOfInterest <- "Sepal.Length"
iris %>%
group_by(Species) %>%
summarize(avg = mean(.data[[colOfInterest]])) %>%
ungroup
Of course this is straight forward in base R:
aggregate(list(avg = iris[[colOfInterest]]), iris["Species"], mean)
Another solution:
iris %>%
group_by(Species) %>%
summarise_at(vars("Sepal.Length"), mean) %>%
ungroup()
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59

Resources