dplyr-summarise, keep the original group name - r

iris %>% mutate(subgroup=rep(c('A','B'),75)) %>% group_by(Species) %>% summarise(SLmin=min(Sepal.Length))
Species SLmin
<fct> <dbl>
1 setosa 4.3
2 versicolor 4.9
3 virginica 4.9
I want to keep the original subgroup name.
but
iris %>% mutate(subgroup=rep(c('A','B'),75)) %>% group_by(Species,subgroup) %>% summarise(SLmin=min(Sepal.Length))
Species subgroup SLmin
<fct> <chr> <dbl>
1 setosa A 4.4
2 setosa B 4.3
3 versicolor A 5
4 versicolor B 4.9
5 virginica A 4.9
6 virginica B 5.6
this code cannot get minimum at each species.
do you know any idea?
PS:
It was hard to explain, so I'll fix it.
I need subgroups.
After summarizing the results.
setosa B 4.3
versicolor B 4.9
virginica A 4.9

You can use which.min to get index of minimum value of Sepal.Length, this index can be used to subset corresponding subgroup value.
library(dplyr)
iris %>%
mutate(subgroup=rep(c('A','B'),75)) %>%
group_by(Species) %>%
summarise(SLmin=min(Sepal.Length),
subgroup = subgroup[which.min(Sepal.Length)])
# Species SLmin subgroup
# <fct> <dbl> <chr>
#1 setosa 4.3 B
#2 versicolor 4.9 B
#3 virginica 4.9 A
Also an alternative is to select the minimum row for each Species and then select only those columns that we need in the final output.
iris %>%
mutate(subgroup=rep(c('A','B'),75)) %>%
group_by(Species) %>%
slice(which.min(Sepal.Length))

Related

How to add new row with concatenated strings for each group? [duplicate]

If I add a new row to the iris dataset with:
iris <- as_tibble(iris)
> iris %>%
add_row(.before=0)
# A tibble: 151 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 NA NA NA NA <NA> <--- Good!
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3.0 1.4 0.2 setosa
It works. So, why can't I add a new row on top of each "subset" with:
iris %>%
group_by(Species) %>%
add_row(.before=0)
Error: is.data.frame(df) is not TRUE
If you want to use a grouped operation, you need do like JasonWang described in his comment, as other functions like mutate or summarise expect a result with the same number of rows as the grouped data frame (in your case, 50) or with one row (e.g. when summarising).
As you probably know, in general do can be slow and should be a last resort if you cannot achieve your result in another way. Your task is quite simple because it only involves adding extra rows in your data frame, which can be done by simple indexing, e.g. look at the output of iris[NA, ].
What you want is essentially to create a vector
indices <- c(NA, 1:50, NA, 51:100, NA, 101:150)
(since the first group is in rows 1 to 50, the second one in 51 to 100 and the third one in 101 to 150).
The result is then iris[indices, ].
A more general way of building this vector uses group_indices.
indices <- seq(nrow(iris)) %>%
split(group_indices(iris, Species)) %>%
map(~c(NA, .x)) %>%
unlist
(map comes from purrr which I assume you have loaded as you have tagged this with tidyverse).
A more recent version would be using group_modify() instead of do().
iris %>%
as_tibble() %>%
group_by(Species) %>%
group_modify(~ add_row(.x,.before=0))
#> # A tibble: 153 x 5
#> # Groups: Species [3]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa NA NA NA NA
#> 2 setosa 5.1 3.5 1.4 0.2
#> 3 setosa 4.9 3 1.4 0.2
With a slight variation, this could also be done:
library(purrr)
library(tibble)
iris %>%
group_split(Species) %>%
map_dfr(~ .x %>%
add_row(.before = 1))
# A tibble: 153 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 NA NA NA NA NA
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3 1.4 0.2 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
6 5 3.6 1.4 0.2 setosa
7 5.4 3.9 1.7 0.4 setosa
8 4.6 3.4 1.4 0.3 setosa
9 5 3.4 1.5 0.2 setosa
10 4.4 2.9 1.4 0.2 setosa
# ... with 143 more rows
This also can be used for grouped data frame, however, it's a bit verbose:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = c(NA, Sepal.Length),
Sepal.Width = c(NA, Sepal.Width),
Petal.Length = c(NA, Petal.Length),
Petal.Width = c(NA, Petal.Width),
Species = c(NA, Species))

Extract min/max by group in R

(Using Iris for reproducibility)
I want to calculate min/max row by Petal.Width & grouped by Species in R. I have done that using two approaches, I want to understand is there a better approach (preferably tidyverse) , also note because of ties answer might vary in both. Please correct if there is anything wrong in both these approaches.
Approach 1
library(tidyverse)
iris %>%
group_by(Species) %>%
slice_max(Petal.Width, n = 1, with_ties=FALSE) %>%
rbind(
iris %>%
group_by(Species) %>%
slice_min(Petal.Width, n = 1, with_ties=FALSE))
Approach 2
iris %>%
group_by(Species) %>%
arrange(Petal.Width) %>%
filter(row_number() %in% c(1,n()))
Here is a the way to do it with summarise(across()):
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(across(.cols = Petal.Width,
.fns = list(min = min, max = max),
.names = "{col}_{fn}"))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
Species Petal.Width_min Petal.Width_max
<fct> <dbl> <dbl>
1 setosa 0.1 0.6
2 versicolor 1 1.8
3 virginica 1.4 2.5
You could easily find the min and max of every numerical variable in a data set this way:
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric),
.fns = list(min = min, max = max),
.names = "{col}_{fn}"))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 9
Species Sepal.Length_min Sepal.Length_max Sepal.Width_min Sepal.Width_max Petal.Length_min Petal.Length_max Petal.Width_min Petal.Width_max
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.3 5.8 2.3 4.4 1 1.9 0.1 0.6
2 versicolor 4.9 7 2 3.4 3 5.1 1 1.8
3 virginica 4.9 7.9 2.2 3.8 4.5 6.9 1.4 2.5
You could also use slice like below:
iris %>%
group_by(Species) %>%
slice(which.min(Petal.Width),
which.max(Petal.Width))
Output:
# A tibble: 6 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5 3.5 1.6 0.6 setosa
2 5.9 3.2 4.8 1.8 versicolor
3 6.3 3.3 6 2.5 virginica
4 4.9 3.1 1.5 0.1 setosa
5 4.9 2.4 3.3 1 versicolor
6 6.1 2.6 5.6 1.4 virginica
Using aggregate.
aggregate(Petal.Width ~ Species, iris, function(x) c(min=min(x), max=max(x)))
# Species Petal.Width.min Petal.Width.max
# 1 setosa 0.1 0.6
# 2 versicolor 1.0 1.8
# 3 virginica 1.4 2.5

Is there a way to use a lookup value from a table in a mutate column?

library(tidyverse)
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
mutate(new_col = rank == 4, Sepal.Width)
table <- df %>%
filter(rank == 4) %>%
select(Species, new_col = Sepal.Width)
correct_df <- left_join(df, table, by = "Species")
df
#> # A tibble: 150 x 8
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Dim
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 0.280
#> 2 4.9 3 1.4 0.2 setosa 0.280
#> 3 4.7 3.2 1.3 0.2 setosa 0.26
#> 4 4.6 3.1 1.5 0.2 setosa 0.3
#> 5 5 3.6 1.4 0.2 setosa 0.280
#> 6 5.4 3.9 1.7 0.4 setosa 0.68
#> 7 4.6 3.4 1.4 0.3 setosa 0.42
#> 8 5 3.4 1.5 0.2 setosa 0.3
#> 9 4.4 2.9 1.4 0.2 setosa 0.280
#> 10 4.9 3.1 1.5 0.1 setosa 0.15
#> # ... with 140 more rows, and 2 more variables: rank <dbl>, new_col <lgl>
I'm basically looking for new_col to show the value that corresponds with rank = 4 from the Sepal.Width column. In this case, those values would be 3.9, 3.3, and 3.8. I'm envisioning this similar to a VLookup, or Index/Match in Excel.
When ever I think "now I need to use VLOOKUP like I did in the past in Excel" I find the left_join() function helpful. It's also part of the dplyr package. Instead of "looking up" values in one table in another table, it's easier for R to just make one bigger table where one table remains unchanged (here the "left" one or the first term you put in the function) and the other is added using a column or columns they have in common as an index.
In your specific example, I can't entirely understand what you want new_col to have in it. If you want to do Excel-style VLOOKUP in R, then left_join() is the best starting point.
The question is not clear since it does not mention the purpose of a Vlookup or Index/Match like operation from Excel.
Also, you don't mention what value should "new_col" have if rank is not equal to 4.
Assuming the value is NA, the below solution with a simple ifelse would work:
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
ungroup() %>%
mutate(new_col = ifelse(rank == 4, Sepal.Width,NA))
df

Mutate new variable based on group-level statistic

I want to append the group-maximum to table of observations, e.g:
iris %>% split(iris$Species) %>%
lapply(function(l) mutate(l, species_max = max(Sepal.Width))) %>%
bind_rows() %>% .[c(1,51,101),]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species species_max
1 5.1 3.5 1.4 0.2 setosa 4.4
51 7.0 3.2 4.7 1.4 versicolor 3.4
101 6.3 3.3 6.0 2.5 virginica 3.8
Is there a more elegant dplyr::group_by solution to achieve this?
How about this:
group_by(iris, Species) %>%
mutate(species_max = max(Sepal.Width)) %>%
slice(1)
# Source: local data frame [3 x 6]
# Groups: Species [3]
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species species_max
# <dbl> <dbl> <dbl> <dbl> <fctr> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 4.4
# 2 7.0 3.2 4.7 1.4 versicolor 3.4
# 3 6.3 3.3 6.0 2.5 virginica 3.8
The difficulty here is that you need to summarise multiple columns (for which summarise_all would be great) but at the same time you need to add a new column (for which you either need a simple summarise or mutate call).
In this regard data.table allows greater flexibility since it only relies on a list in its j-argument. So you can do it as follows with data.table, just as a comparison:
library(data.table)
dt <- as.data.table(iris)
dt[, c(lapply(.SD, first), species_max = max(Sepal.Width)), by = Species]

Add row in each group using dplyr and add_row()

If I add a new row to the iris dataset with:
iris <- as_tibble(iris)
> iris %>%
add_row(.before=0)
# A tibble: 151 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 NA NA NA NA <NA> <--- Good!
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3.0 1.4 0.2 setosa
It works. So, why can't I add a new row on top of each "subset" with:
iris %>%
group_by(Species) %>%
add_row(.before=0)
Error: is.data.frame(df) is not TRUE
If you want to use a grouped operation, you need do like JasonWang described in his comment, as other functions like mutate or summarise expect a result with the same number of rows as the grouped data frame (in your case, 50) or with one row (e.g. when summarising).
As you probably know, in general do can be slow and should be a last resort if you cannot achieve your result in another way. Your task is quite simple because it only involves adding extra rows in your data frame, which can be done by simple indexing, e.g. look at the output of iris[NA, ].
What you want is essentially to create a vector
indices <- c(NA, 1:50, NA, 51:100, NA, 101:150)
(since the first group is in rows 1 to 50, the second one in 51 to 100 and the third one in 101 to 150).
The result is then iris[indices, ].
A more general way of building this vector uses group_indices.
indices <- seq(nrow(iris)) %>%
split(group_indices(iris, Species)) %>%
map(~c(NA, .x)) %>%
unlist
(map comes from purrr which I assume you have loaded as you have tagged this with tidyverse).
A more recent version would be using group_modify() instead of do().
iris %>%
as_tibble() %>%
group_by(Species) %>%
group_modify(~ add_row(.x,.before=0))
#> # A tibble: 153 x 5
#> # Groups: Species [3]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa NA NA NA NA
#> 2 setosa 5.1 3.5 1.4 0.2
#> 3 setosa 4.9 3 1.4 0.2
With a slight variation, this could also be done:
library(purrr)
library(tibble)
iris %>%
group_split(Species) %>%
map_dfr(~ .x %>%
add_row(.before = 1))
# A tibble: 153 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 NA NA NA NA NA
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3 1.4 0.2 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
6 5 3.6 1.4 0.2 setosa
7 5.4 3.9 1.7 0.4 setosa
8 4.6 3.4 1.4 0.3 setosa
9 5 3.4 1.5 0.2 setosa
10 4.4 2.9 1.4 0.2 setosa
# ... with 143 more rows
This also can be used for grouped data frame, however, it's a bit verbose:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = c(NA, Sepal.Length),
Sepal.Width = c(NA, Sepal.Width),
Petal.Length = c(NA, Petal.Length),
Petal.Width = c(NA, Petal.Width),
Species = c(NA, Species))

Resources