(Using Iris for reproducibility)
I want to calculate min/max row by Petal.Width & grouped by Species in R. I have done that using two approaches, I want to understand is there a better approach (preferably tidyverse) , also note because of ties answer might vary in both. Please correct if there is anything wrong in both these approaches.
Approach 1
library(tidyverse)
iris %>%
group_by(Species) %>%
slice_max(Petal.Width, n = 1, with_ties=FALSE) %>%
rbind(
iris %>%
group_by(Species) %>%
slice_min(Petal.Width, n = 1, with_ties=FALSE))
Approach 2
iris %>%
group_by(Species) %>%
arrange(Petal.Width) %>%
filter(row_number() %in% c(1,n()))
Here is a the way to do it with summarise(across()):
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(across(.cols = Petal.Width,
.fns = list(min = min, max = max),
.names = "{col}_{fn}"))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
Species Petal.Width_min Petal.Width_max
<fct> <dbl> <dbl>
1 setosa 0.1 0.6
2 versicolor 1 1.8
3 virginica 1.4 2.5
You could easily find the min and max of every numerical variable in a data set this way:
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric),
.fns = list(min = min, max = max),
.names = "{col}_{fn}"))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 9
Species Sepal.Length_min Sepal.Length_max Sepal.Width_min Sepal.Width_max Petal.Length_min Petal.Length_max Petal.Width_min Petal.Width_max
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.3 5.8 2.3 4.4 1 1.9 0.1 0.6
2 versicolor 4.9 7 2 3.4 3 5.1 1 1.8
3 virginica 4.9 7.9 2.2 3.8 4.5 6.9 1.4 2.5
You could also use slice like below:
iris %>%
group_by(Species) %>%
slice(which.min(Petal.Width),
which.max(Petal.Width))
Output:
# A tibble: 6 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5 3.5 1.6 0.6 setosa
2 5.9 3.2 4.8 1.8 versicolor
3 6.3 3.3 6 2.5 virginica
4 4.9 3.1 1.5 0.1 setosa
5 4.9 2.4 3.3 1 versicolor
6 6.1 2.6 5.6 1.4 virginica
Using aggregate.
aggregate(Petal.Width ~ Species, iris, function(x) c(min=min(x), max=max(x)))
# Species Petal.Width.min Petal.Width.max
# 1 setosa 0.1 0.6
# 2 versicolor 1.0 1.8
# 3 virginica 1.4 2.5
Related
If I add a new row to the iris dataset with:
iris <- as_tibble(iris)
> iris %>%
add_row(.before=0)
# A tibble: 151 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 NA NA NA NA <NA> <--- Good!
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3.0 1.4 0.2 setosa
It works. So, why can't I add a new row on top of each "subset" with:
iris %>%
group_by(Species) %>%
add_row(.before=0)
Error: is.data.frame(df) is not TRUE
If you want to use a grouped operation, you need do like JasonWang described in his comment, as other functions like mutate or summarise expect a result with the same number of rows as the grouped data frame (in your case, 50) or with one row (e.g. when summarising).
As you probably know, in general do can be slow and should be a last resort if you cannot achieve your result in another way. Your task is quite simple because it only involves adding extra rows in your data frame, which can be done by simple indexing, e.g. look at the output of iris[NA, ].
What you want is essentially to create a vector
indices <- c(NA, 1:50, NA, 51:100, NA, 101:150)
(since the first group is in rows 1 to 50, the second one in 51 to 100 and the third one in 101 to 150).
The result is then iris[indices, ].
A more general way of building this vector uses group_indices.
indices <- seq(nrow(iris)) %>%
split(group_indices(iris, Species)) %>%
map(~c(NA, .x)) %>%
unlist
(map comes from purrr which I assume you have loaded as you have tagged this with tidyverse).
A more recent version would be using group_modify() instead of do().
iris %>%
as_tibble() %>%
group_by(Species) %>%
group_modify(~ add_row(.x,.before=0))
#> # A tibble: 153 x 5
#> # Groups: Species [3]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa NA NA NA NA
#> 2 setosa 5.1 3.5 1.4 0.2
#> 3 setosa 4.9 3 1.4 0.2
With a slight variation, this could also be done:
library(purrr)
library(tibble)
iris %>%
group_split(Species) %>%
map_dfr(~ .x %>%
add_row(.before = 1))
# A tibble: 153 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 NA NA NA NA NA
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3 1.4 0.2 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
6 5 3.6 1.4 0.2 setosa
7 5.4 3.9 1.7 0.4 setosa
8 4.6 3.4 1.4 0.3 setosa
9 5 3.4 1.5 0.2 setosa
10 4.4 2.9 1.4 0.2 setosa
# ... with 143 more rows
This also can be used for grouped data frame, however, it's a bit verbose:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = c(NA, Sepal.Length),
Sepal.Width = c(NA, Sepal.Width),
Petal.Length = c(NA, Petal.Length),
Petal.Width = c(NA, Petal.Width),
Species = c(NA, Species))
iris %>% mutate(subgroup=rep(c('A','B'),75)) %>% group_by(Species) %>% summarise(SLmin=min(Sepal.Length))
Species SLmin
<fct> <dbl>
1 setosa 4.3
2 versicolor 4.9
3 virginica 4.9
I want to keep the original subgroup name.
but
iris %>% mutate(subgroup=rep(c('A','B'),75)) %>% group_by(Species,subgroup) %>% summarise(SLmin=min(Sepal.Length))
Species subgroup SLmin
<fct> <chr> <dbl>
1 setosa A 4.4
2 setosa B 4.3
3 versicolor A 5
4 versicolor B 4.9
5 virginica A 4.9
6 virginica B 5.6
this code cannot get minimum at each species.
do you know any idea?
PS:
It was hard to explain, so I'll fix it.
I need subgroups.
After summarizing the results.
setosa B 4.3
versicolor B 4.9
virginica A 4.9
You can use which.min to get index of minimum value of Sepal.Length, this index can be used to subset corresponding subgroup value.
library(dplyr)
iris %>%
mutate(subgroup=rep(c('A','B'),75)) %>%
group_by(Species) %>%
summarise(SLmin=min(Sepal.Length),
subgroup = subgroup[which.min(Sepal.Length)])
# Species SLmin subgroup
# <fct> <dbl> <chr>
#1 setosa 4.3 B
#2 versicolor 4.9 B
#3 virginica 4.9 A
Also an alternative is to select the minimum row for each Species and then select only those columns that we need in the final output.
iris %>%
mutate(subgroup=rep(c('A','B'),75)) %>%
group_by(Species) %>%
slice(which.min(Sepal.Length))
This is regarding latest tidyr release. I am trying pivot_wider & pivot_longer function from library(tidyr) (Update 1.0.0)
I was trying to obtain normal iris dataset when I run below but instead I get nested sort of 3X5 dimension tibble, not sure whats happening (I read https://tidyr.tidyverse.org/articles/pivot.html) but still not sure how to avoid this
library(tidyr)
iris %>% pivot_longer(-Species,values_to = "count") %>%
pivot_wider(names_from = name, values_from = count)
Expected Output: Normal Iris dataset (150 X 5 dimension)
Edit: I read below that if I wrap around unnest() I get expected output. I am not able to understand why to unnest it when we did not nest it anywhere. Any basic help would be appreciated. Want to understand the concept of what went wrong.
As I learnt from Akrun & other helpful friends & post
(Not a bug or anything)
spread(., name, count) throws an error because we have multiple rows for each species x name. pivot_wider does a better job by providing a list-columns instead. If we add unique ID to each row then it works fine.
library(tidyverse)
iris %>%
rowid_to_column() %>%
pivot_longer(-c(rowid, Species), values_to = "count") %>%
pivot_wider(names_from = name, values_from = count) %>%
select(-rowid)
pivot_wider(), unlike nest(), allows us to aggregate multiple values when the rows are not given a unique identifier.
The default is to use list to aggregate and to be verbose about it.
To expand the output we could use unnest() as already suggested but it's more idiomatic to use unchop() because we're not trying to expand a horizontal dimensionality in the nested values.
So to sum it all up to get back your initial data (except it'll be a tibble) you can do:
library(tidyr)
iris %>%
pivot_longer(-Species,values_to = "count") %>%
print() %>%
pivot_wider(names_from = name,
values_from = count,
values_fn = list(count=list)) %>%
print() %>%
unchop(everything()) %>%
print() %>%
all.equal(iris)
#> # A tibble: 600 x 3
#> Species name count
#> <fct> <chr> <dbl>
#> 1 setosa Sepal.Length 5.1
#> 2 setosa Sepal.Width 3.5
#> 3 setosa Petal.Length 1.4
#> 4 setosa Petal.Width 0.2
#> 5 setosa Sepal.Length 4.9
#> 6 setosa Sepal.Width 3
#> 7 setosa Petal.Length 1.4
#> 8 setosa Petal.Width 0.2
#> 9 setosa Sepal.Length 4.7
#> 10 setosa Sepal.Width 3.2
#> # ... with 590 more rows
#> # A tibble: 3 x 5
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <list<dbl>> <list<dbl>> <list<dbl>> <list<dbl>>
#> 1 setosa [50] [50] [50] [50]
#> 2 versicolor [50] [50] [50] [50]
#> 3 virginica [50] [50] [50] [50]
#> # A tibble: 150 x 5
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.1 3.5 1.4 0.2
#> 2 setosa 4.9 3 1.4 0.2
#> 3 setosa 4.7 3.2 1.3 0.2
#> 4 setosa 4.6 3.1 1.5 0.2
#> 5 setosa 5 3.6 1.4 0.2
#> 6 setosa 5.4 3.9 1.7 0.4
#> 7 setosa 4.6 3.4 1.4 0.3
#> 8 setosa 5 3.4 1.5 0.2
#> 9 setosa 4.4 2.9 1.4 0.2
#> 10 setosa 4.9 3.1 1.5 0.1
#> # ... with 140 more rows
#> [1] TRUE
Created on 2019-09-15 by the reprex package (v0.3.0)
Could someone just explain how I'd use something in the apply family to carry this out across a list...
list1[[1]][1:31,] %>% arrange(vuln)
Essentially all I need to do is select rows 1:31 and then arrange the dataset using vuln. The above achieves this but does it on the first data frame in the list. I was guessing something similar to this:
apply(list1,2,function(x)list[x][1:31] %>% arrange(vuln))
but the above doesn't seem to work. Also just for comparisons could I see a loop that would schieve the same?
Thanks!
This would be the tidyverse way:
library(dplyr)
library(purrr)
your_list <- list(head(iris),tail(iris))
your_list %>% modify(
. %>% slice(1:3) %>% arrange(Sepal.Length))
# [[1]]
# # A tibble: 3 x 5
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 4.7 3.2 1.3 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 5.1 3.5 1.4 0.2 setosa
#
# [[2]]
# # A tibble: 3 x 5
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 6.3 2.5 5.0 1.9 virginica
# 2 6.7 3.3 5.7 2.5 virginica
# 3 6.7 3.0 5.2 2.3 virginica
And this is how to make your solution work with minor corrections :
lapply(your_list,function(x) x[1:3,] %>% arrange(Sepal.Length))
If I add a new row to the iris dataset with:
iris <- as_tibble(iris)
> iris %>%
add_row(.before=0)
# A tibble: 151 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 NA NA NA NA <NA> <--- Good!
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3.0 1.4 0.2 setosa
It works. So, why can't I add a new row on top of each "subset" with:
iris %>%
group_by(Species) %>%
add_row(.before=0)
Error: is.data.frame(df) is not TRUE
If you want to use a grouped operation, you need do like JasonWang described in his comment, as other functions like mutate or summarise expect a result with the same number of rows as the grouped data frame (in your case, 50) or with one row (e.g. when summarising).
As you probably know, in general do can be slow and should be a last resort if you cannot achieve your result in another way. Your task is quite simple because it only involves adding extra rows in your data frame, which can be done by simple indexing, e.g. look at the output of iris[NA, ].
What you want is essentially to create a vector
indices <- c(NA, 1:50, NA, 51:100, NA, 101:150)
(since the first group is in rows 1 to 50, the second one in 51 to 100 and the third one in 101 to 150).
The result is then iris[indices, ].
A more general way of building this vector uses group_indices.
indices <- seq(nrow(iris)) %>%
split(group_indices(iris, Species)) %>%
map(~c(NA, .x)) %>%
unlist
(map comes from purrr which I assume you have loaded as you have tagged this with tidyverse).
A more recent version would be using group_modify() instead of do().
iris %>%
as_tibble() %>%
group_by(Species) %>%
group_modify(~ add_row(.x,.before=0))
#> # A tibble: 153 x 5
#> # Groups: Species [3]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa NA NA NA NA
#> 2 setosa 5.1 3.5 1.4 0.2
#> 3 setosa 4.9 3 1.4 0.2
With a slight variation, this could also be done:
library(purrr)
library(tibble)
iris %>%
group_split(Species) %>%
map_dfr(~ .x %>%
add_row(.before = 1))
# A tibble: 153 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 NA NA NA NA NA
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3 1.4 0.2 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
6 5 3.6 1.4 0.2 setosa
7 5.4 3.9 1.7 0.4 setosa
8 4.6 3.4 1.4 0.3 setosa
9 5 3.4 1.5 0.2 setosa
10 4.4 2.9 1.4 0.2 setosa
# ... with 143 more rows
This also can be used for grouped data frame, however, it's a bit verbose:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = c(NA, Sepal.Length),
Sepal.Width = c(NA, Sepal.Width),
Petal.Length = c(NA, Petal.Length),
Petal.Width = c(NA, Petal.Width),
Species = c(NA, Species))