I want to get summary of multiple columns in a data frame group-wise. I'm using dplyr::group_by and dplyr::summarise_if to get the results, but I'm unable to get name the columns according to the names of the columns which are being summarised.
The following example illustrates this:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
library(tidyr)
iris %>%
group_by(Species) %>%
summarise_if(.predicate = is.numeric,
.funs = ~ list(enframe(x = summary(object = .)))) %>%
unnest() %>%
select(which(x = !duplicated(x = lapply(X = .,
FUN = summary))))
#> # A tibble: 18 x 6
#> Species name value value1 value2 value3
#> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa Min. 4.3 2.3 1 0.1
#> 2 setosa 1st Qu. 4.8 3.2 1.4 0.2
#> 3 setosa Median 5 3.4 1.5 0.2
#> 4 setosa Mean 5.01 3.43 1.46 0.246
#> 5 setosa 3rd Qu. 5.2 3.68 1.58 0.3
#> 6 setosa Max. 5.8 4.4 1.9 0.6
#> 7 versicolor Min. 4.9 2 3 1
#> 8 versicolor 1st Qu. 5.6 2.52 4 1.2
#> 9 versicolor Median 5.9 2.8 4.35 1.3
#> 10 versicolor Mean 5.94 2.77 4.26 1.33
#> 11 versicolor 3rd Qu. 6.3 3 4.6 1.5
#> 12 versicolor Max. 7 3.4 5.1 1.8
#> 13 virginica Min. 4.9 2.2 4.5 1.4
#> 14 virginica 1st Qu. 6.22 2.8 5.1 1.8
#> 15 virginica Median 6.5 3 5.55 2
#> 16 virginica Mean 6.59 2.97 5.55 2.03
#> 17 virginica 3rd Qu. 6.9 3.18 5.88 2.3
#> 18 virginica Max. 7.9 3.8 6.9 2.5
Created on 2019-05-15 by the reprex package (v0.2.1)
As you can see, the columns are named value, value1, etc, whereas I'd like them to be Sepal.Length, Sepal.Width, etc. After I get this result, of course it is possible to name the columns manually, but I guess there's a better way to do it using the value argument of tibble::enframe.
As an alternative, I'm currently using the following method. It requires a fake data, which is also not preferable.
iris %>%
group_by(Species) %>%
summarise_if(.predicate = is.numeric,
.funs = ~ list(summary(object = .))) %>%
unnest() %>%
group_by(Species) %>%
mutate(Statistic = names(x = summary(object = rnorm(n = 1)))) %>%
ungroup() %>%
select(Species, Statistic, everything())
Any help will be appreciated.
Might be this way? I didn't sort it according to the name within each Species, but I think it isn't important.
library(tidyverse)
iris %>%
group_by(Species) %>%
summarise_if(is.numeric, . ~ list(enframe(summary(.)))) %>%
gather('key', 'value', -Species) %>%
unnest() %>%
spread(key, value)
## A tibble: 18 x 6
# Species name Petal.Length Petal.Width Sepal.Length Sepal.Width
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 setosa 1st Qu. 1.4 0.2 4.8 3.2
# 2 setosa 3rd Qu. 1.58 0.3 5.2 3.68
# 3 setosa Max. 1.9 0.6 5.8 4.4
# 4 setosa Mean 1.46 0.246 5.01 3.43
# 5 setosa Median 1.5 0.2 5 3.4
# 6 setosa Min. 1 0.1 4.3 2.3
# 7 versicolor 1st Qu. 4 1.2 5.6 2.52
# 8 versicolor 3rd Qu. 4.6 1.5 6.3 3
# 9 versicolor Max. 5.1 1.8 7 3.4
#10 versicolor Mean 4.26 1.33 5.94 2.77
#11 versicolor Median 4.35 1.3 5.9 2.8
#12 versicolor Min. 3 1 4.9 2
#13 virginica 1st Qu. 5.1 1.8 6.22 2.8
#14 virginica 3rd Qu. 5.88 2.3 6.9 3.18
#15 virginica Max. 6.9 2.5 7.9 3.8
#16 virginica Mean 5.55 2.03 6.59 2.97
#17 virginica Median 5.55 2 6.5 3
#18 virginica Min. 4.5 1.4 4.9 2.2
Related
This question already has answers here:
What is the difference between `%in%` and `==`?
(3 answers)
Closed last year.
I use dplyr quite a lot for data wrangling, but I never figured out dplyr filter behaviour when using filter(df, variable == c(value1, value2)
Lets use iris data set as an example.
library(dplyr)
data(iris)
# I want to filter by Species 'setosa' and 'versicolor'
# Solution 1
filter1 <- filter(iris, Species == 'setosa' | Species == 'versicolor')
nrow(filter1)
[1] 100 # expected result
# Solution 2
filter2 <- filter(iris, Species %in% c('setosa', 'versicolor'))
nrow(filter2)
[1] 100 # expected result
filter1 == filter2 # both solutions return the exact same result
#Solution 3
filter3 <- filter(iris, Species == c('setosa', 'versicolor'))
nrow(filter3)
[1] 50 # unexpected result
unique(filter3$Species)
[1] setosa versicolor
Levels: setosa versicolor virginica
Although Solution 3 is filtering for the intended species, as shown by unique(filter3$Species), it only returns half of the occurrences (50 compared to 100 in Solution 1and Solution2). I would appreciate some guidance on what is actually going on in Solution 3.
filter(iris, Species == c("versicolor", "setosa")) does not make sense in an intuitive way, because one Species is not a 2-tuple:
> "setosa" == c("setosa", "versicolor")
[1] TRUE FALSE
Interestingly, filter(iris, Species == c("setosa", "versicolor")) produce the same results: The first Species of the data frame will be returned, so descending sorting will give you versicolor:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris %>%
as_tibble()
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
iris %>%
filter(Species == c('setosa', 'versicolor')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 4.9 3 1.4 0.2 setosa
#> 2 4.6 3.1 1.5 0.2 setosa
#> 3 5.4 3.9 1.7 0.4 setosa
#> 4 5 3.4 1.5 0.2 setosa
#> 5 4.9 3.1 1.5 0.1 setosa
#> 6 4.8 3.4 1.6 0.2 setosa
#> 7 4.3 3 1.1 0.1 setosa
#> 8 5.7 4.4 1.5 0.4 setosa
#> 9 5.1 3.5 1.4 0.3 setosa
#> 10 5.1 3.8 1.5 0.3 setosa
#> # … with 40 more rows
iris %>%
arrange(Species) %>%
filter(Species == c('versicolor', 'setosa')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 4.9 3 1.4 0.2 setosa
#> 2 4.6 3.1 1.5 0.2 setosa
#> 3 5.4 3.9 1.7 0.4 setosa
#> 4 5 3.4 1.5 0.2 setosa
#> 5 4.9 3.1 1.5 0.1 setosa
#> 6 4.8 3.4 1.6 0.2 setosa
#> 7 4.3 3 1.1 0.1 setosa
#> 8 5.7 4.4 1.5 0.4 setosa
#> 9 5.1 3.5 1.4 0.3 setosa
#> 10 5.1 3.8 1.5 0.3 setosa
#> # … with 40 more rows
iris %>%
arrange(desc(Species)) %>%
filter(Species == c('setosa', 'versicolor')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 6.4 3.2 4.5 1.5 versicolor
#> 2 5.5 2.3 4 1.3 versicolor
#> 3 5.7 2.8 4.5 1.3 versicolor
#> 4 4.9 2.4 3.3 1 versicolor
#> 5 5.2 2.7 3.9 1.4 versicolor
#> 6 5.9 3 4.2 1.5 versicolor
#> 7 6.1 2.9 4.7 1.4 versicolor
#> 8 6.7 3.1 4.4 1.4 versicolor
#> 9 5.8 2.7 4.1 1 versicolor
#> 10 5.6 2.5 3.9 1.1 versicolor
#> # … with 40 more rows
iris %>%
arrange(desc(Species)) %>%
filter(Species == c('versicolor', 'setosa')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7 3.2 4.7 1.4 versicolor
#> 2 6.9 3.1 4.9 1.5 versicolor
#> 3 6.5 2.8 4.6 1.5 versicolor
#> 4 6.3 3.3 4.7 1.6 versicolor
#> 5 6.6 2.9 4.6 1.3 versicolor
#> 6 5 2 3.5 1 versicolor
#> 7 6 2.2 4 1 versicolor
#> 8 5.6 2.9 3.6 1.3 versicolor
#> 9 5.6 3 4.5 1.5 versicolor
#> 10 6.2 2.2 4.5 1.5 versicolor
#> # … with 40 more rows
Created on 2022-02-11 by the reprex package (v2.0.0)
I have this example list that contains 3 dataframes:
library(tidyverse)
list_df <- iris %>%
group_by(Species) %>%
slice(1:3) %>%
ungroup() %>%
group_split(Species)
I want to add a new row at the end of each dataframe that shows the column median
My try so far (and earlier this day I am sure it worked) is not working:
list_df %>%
map_dfr([,1:4], ~ .x %>%
add_row(!!!map(., median)))
I want to learn why my code is not working and what exactly !!! is for ins this situation.
The [, 1:4] doesn't include the data i.e. it only shows the index and thus it fails
list_df %>%
map_dfr(~ .x %>%
add_row(!!! map(.[1:4], median)))
-output
# A tibble: 12 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.9 3.2 1.4 0.2 <NA>
5 7 3.2 4.7 1.4 versicolor
6 6.4 3.2 4.5 1.5 versicolor
7 6.9 3.1 4.9 1.5 versicolor
8 6.9 3.2 4.7 1.5 <NA>
9 6.3 3.3 6 2.5 virginica
10 5.8 2.7 5.1 1.9 virginica
11 7.1 3 5.9 2.1 virginica
12 6.3 3 5.9 2.1 <NA>
If we want to add a row with the group information, another option is group_modify (without splitting)
iris %>%
group_by(Species) %>%
slice(1:3) %>%
group_modify(~ .x %>%
add_row(!!! map(.x, median))) %>%
ungroup
-output
# A tibble: 12 x 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 1.4 0.2
2 setosa 4.9 3 1.4 0.2
3 setosa 4.7 3.2 1.3 0.2
4 setosa 4.9 3.2 1.4 0.2
5 versicolor 7 3.2 4.7 1.4
6 versicolor 6.4 3.2 4.5 1.5
7 versicolor 6.9 3.1 4.9 1.5
8 versicolor 6.9 3.2 4.7 1.5
9 virginica 6.3 3.3 6 2.5
10 virginica 5.8 2.7 5.1 1.9
11 virginica 7.1 3 5.9 2.1
12 virginica 6.3 3 5.9 2.1
If we want to add the median rows,
iris %>%
group_by(Species) %>%
slice(1:3) %>%
group_modify(~ .x %>%
add_row(!!! map(.x, median))) %>%
mutate(rn = row_number()) %>%
ungroup %>%
summarise(across(2:5, ~ c(.[rn < max(rn)],
sum(.[rn == max(rn)]))), Species = c(Species[rn != max(rn)],
"Total"))
-output
# A tibble: 10 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 1
2 4.9 3 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 7 3.2 4.7 1.4 2
5 6.4 3.2 4.5 1.5 2
6 6.9 3.1 4.9 1.5 2
7 6.3 3.3 6 2.5 3
8 5.8 2.7 5.1 1.9 3
9 7.1 3 5.9 2.1 3
10 18.1 9.4 12 3.8 Total
I'm trying to calculate 25, 50 and 75 percentile of all cuantitative variables grouped by the specie of the dataset iris, so using dplyr::summarize_at function is possible to do it just once. I use the following code but i allways get an error:
iris %>%
group_by(Species) %>%
summarize_at(dplyr::vars(c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")),
.funs=c("25%"=quantile(0.25),
"50%"=quantile(0.50),
"75%"=quantile(0.75)))
This is the error i get: "Error: expecting a one sided formula, a function, or a function name."
Thank you for your help.
I can propose you a data.table solution. Unfortunately, I don't have a dplyr solution in mind.
dt <- data.table::as.data.table(iris)
dt <- dt[,lapply(.SD, quantile, probs = c(.25,.5,.75)),
.SDcols = c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width"),
by = "Species"]
dt[,'quantile' := c("25%","50%","75%")]
# Sepal.Length Sepal.Width Petal.Length Petal.Width # Species Sepal.Length Sepal.Width Petal.Length Petal.Width quantile
# 1: setosa 4.800 3.200 1.400 0.2 25%
# 2: setosa 5.000 3.400 1.500 0.2 50%
# 3: setosa 5.200 3.675 1.575 0.3 75%
# 4: versicolor 5.600 2.525 4.000 1.2 25%
# 5: versicolor 5.900 2.800 4.350 1.3 50%
# 6: versicolor 6.300 3.000 4.600 1.5 75%
# 7: virginica 6.225 2.800 5.100 1.8 25%
# 8: virginica 6.500 3.000 5.550 2.0 50%
# 9: virginica 6.900 3.175 5.875 2.3 75%
Hope that helps!
Using the developer version of dplyr(0.8.9) we can use summarise with across. One drawback is that the names of the quantiles are not returned although we can know since we do our operations in the order we desire:
iris %>%
group_by(Species) %>%
summarise(across(is.numeric,~c(`25%`=quantile(.x,0.25), `50%`=
quantile(.x,0.5),
`75%`= quantile(.x,0.75))))
The above is equivalent to:
iris %>%
group_by(Species) %>%
summarise_if(is.numeric,~c(`25%`=quantile(.x,0.25), `50%`=
quantile(.x,0.5),
`75%`= quantile(.x,0.75)))
Result:
# A tibble: 9 x 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.8 3.2 1.4 0.2
2 setosa 5 3.4 1.5 0.2
3 setosa 5.2 3.68 1.58 0.3
4 versicolor 5.6 2.52 4 1.2
5 versicolor 5.9 2.8 4.35 1.3
6 versicolor 6.3 3 4.6 1.5
7 virginica 6.22 2.8 5.1 1.8
8 virginica 6.5 3 5.55 2
9 virginica 6.9 3.18 5.88 2.3
A possibility to add the names of the quantiles. Note however that dplyr and the tidyverse do not recycle vectors which means we'll have to hardcode this:
iris %>%
group_by(Species) %>%
summarise_if(is.numeric,~c(`25%`=quantile(.x,0.25), `50%`=
quantile(.x,0.5),
`75%`= quantile(.x,0.75))) %>%
mutate(quant= rep(c("25%","50%","75%"),nrow(.) / 3))
You can also save the summarise result(res here) and resort to good ol' base for the recycle: res$quant <- c("25%","50%","75%")
# A tibble: 9 x 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width quant
<fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 setosa 4.8 3.2 1.4 0.2 25%
2 setosa 5 3.4 1.5 0.2 50%
3 setosa 5.2 3.68 1.58 0.3 75%
4 versicolor 5.6 2.52 4 1.2 25%
5 versicolor 5.9 2.8 4.35 1.3 50%
6 versicolor 6.3 3 4.6 1.5 75%
7 virginica 6.22 2.8 5.1 1.8 25%
8 virginica 6.5 3 5.55 2 50%
9 virginica 6.9 3.18 5.88 2.3 75%
When using dplyr to create a table of summary statistics that is organized by levels of a variable, I cannot figure out the syntax for calculating quartiles without having to repeat the column name. That is, using calls, such as vars() and list() work with other functions, such as mean() and median() but not with quantile()
Searches have produced antiquated solutions that no longer work because they use deprecated calls, such as do() and/or funs().
data(iris)
library(tidyverse)
#This works: Notice I have not attempted to calculate quartiles yet
summary_stat <- iris %>%
group_by(Species) %>%
summarise_at(vars(Sepal.Length),
list(min=min, median=median, max=max,
mean=mean, sd=sd)
)
A tibble: 3 x 6
Species min median max mean sd
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.3 5 5.8 5.01 0.352
2 versicolor 4.9 5.9 7 5.94 0.516
3 virginica 4.9 6.5 7.9 6.59 0.636
##########################################################################
#Does NOT work:
five_number_summary <- iris %>%
group_by(Species) %>%
summarise_at(vars(Sepal.Length),
list(min=min, Q1=quantile(.,probs = 0.25),
median=median, Q3=quantile(., probs = 0.75),
max=max))
Error: Must use a vector in `[`, not an object of class matrix.
Call `rlang::last_error()` to see a backtrace
###########################################################################
#This works: Remove the vars() argument, remove the list() argument,
#replace summarise_at() with summarise()
#but the code requires repeating the column name (Sepal.Length)
five_number_summary <- iris %>%
group_by(Species) %>%
summarise(min=min(Sepal.Length),
Q1=quantile(Sepal.Length,probs = 0.25),
median=median(Sepal.Length),
Q3=quantile(Sepal.Length, probs = 0.75),
max=max(Sepal.Length))
# A tibble: 3 x 6
Species min Q1 median Q3 max
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.3 4.8 5 5.2 5.8
2 versicolor 4.9 5.6 5.9 6.3 7
3 virginica 4.9 6.22 6.5 6.9 7.9
This last piece of code produces exactly what I am looking for, but I am wondering why there isn't a shorter syntax that doesn't force me to repeat the variable.
You're missing the ~ in front of the quantile function in the summarise_at call that failed. Try the following:
five_number_summary <- iris %>%
group_by(Species) %>%
summarise_at(vars(Sepal.Length),
list(min=min, Q1=~quantile(., probs = 0.25),
median=median, Q3=~quantile(., probs = 0.75),
max=max))
five_number_summary
# A tibble: 3 x 6
Species min Q1 median Q3 max
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 4.3 4.8 5 5.2 5.8
2 versicolor 4.9 5.6 5.9 6.3 7
3 virginica 4.9 6.22 6.5 6.9 7.9
You can create a list column and then use unnest_wider, which requires tidyr 1.0.0
library(tidyverse)
iris %>%
group_by(Species) %>%
summarise(q = list(quantile(Sepal.Length))) %>%
unnest_wider(q)
# # A tibble: 3 x 6
# Species `0%` `25%` `50%` `75%` `100%`
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 setosa 4.3 4.8 5 5.2 5.8
# 2 versicolor 4.9 5.6 5.9 6.3 7
# 3 virginica 4.9 6.22 6.5 6.9 7.9
There's a names_repair argument, but apparently that changes the name of all the columns, and not just the ones being unnested (??)
iris %>%
group_by(Species) %>%
summarise(q = list(quantile(Sepal.Length))) %>%
unnest_wider(q, names_repair = ~paste0('Q_', sub('%', '', .)))
# # A tibble: 3 x 6
# Q_Species Q_0 Q_25 Q_50 Q_75 Q_100
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 setosa 4.3 4.8 5 5.2 5.8
# 2 versicolor 4.9 5.6 5.9 6.3 7
# 3 virginica 4.9 6.22 6.5 6.9 7.9
Another option is group_modify
iris %>%
group_by(Species) %>%
group_modify(~as.data.frame(t(quantile(.$Sepal.Length))))
# # A tibble: 3 x 6
# # Groups: Species [3]
# Species `0%` `25%` `50%` `75%` `100%`
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 setosa 4.3 4.8 5 5.2 5.8
# 2 versicolor 4.9 5.6 5.9 6.3 7
# 3 virginica 4.9 6.22 6.5 6.9 7.9
Or you could use data.table
library(data.table)
irisdt <- as.data.table(iris)
irisdt[, as.list(quantile(Sepal.Length)), Species]
# Species 0% 25% 50% 75% 100%
# 1: setosa 4.3 4.800 5.0 5.2 5.8
# 2: versicolor 4.9 5.600 5.9 6.3 7.0
# 3: virginica 4.9 6.225 6.5 6.9 7.9
A note about a more up-to-date version of #arienrhod
library(dplyr,quietly = TRUE,verbose = FALSE, warn.conflicts = FALSE)
five_number_summary <- iris %>%
group_by(Species) %>%
summarise(across(Sepal.Length, list(min=min, Q1=~quantile(., probs = 0.25),
median=median, Q3=~quantile(., probs = 0.75),
max=max), .names = "{.fn}"))
five_number_summary
#> # A tibble: 3 x 6
#> Species min Q1 median Q3 max
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 4.3 4.8 5 5.2 5.8
#> 2 versicolor 4.9 5.6 5.9 6.3 7
#> 3 virginica 4.9 6.22 6.5 6.9 7.9
Created on 2022-02-21 by the reprex package (v2.0.1)
This question already has answers here:
Give name to list variable
(3 answers)
Closed 3 years ago.
using group_split from dplyr but I need every dataframe in the list to preserve the name.
Example from dplyr documentation (notice the dataframes are numbered. The optimal output is every dataframe to have the name of the grouped variable (Setosa, versicolor....):
ir <- iris %>%
group_by(Species)
group_split(ir)
#> [[1]]
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 40 more rows
#>
#> [[2]]
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7 3.2 4.7 1.4 versicolor
#> 2 6.4 3.2 4.5 1.5 versicolor
#> 3 6.9 3.1 4.9 1.5 versicolor
#> 4 5.5 2.3 4 1.3 versicolor
#> 5 6.5 2.8 4.6 1.5 versicolor
#> 6 5.7 2.8 4.5 1.3 versicolor
#> 7 6.3 3.3 4.7 1.6 versicolor
#> 8 4.9 2.4 3.3 1 versicolor
#> 9 6.6 2.9 4.6 1.3 versicolor
#> 10 5.2 2.7 3.9 1.4 versicolor
#> # … with 40 more rows
#>
#> [[3]]
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 6.3 3.3 6 2.5 virginica
#> 2 5.8 2.7 5.1 1.9 virginica
#> 3 7.1 3 5.9 2.1 virginica
#> 4 6.3 2.9 5.6 1.8 virginica
#> 5 6.5 3 5.8 2.2 virginica
#> 6 7.6 3 6.6 2.1 virginica
#> 7 4.9 2.5 4.5 1.7 virginica
#> 8 7.3 2.9 6.3 1.8 virginica
#> 9 6.7 2.5 5.8 1.8 virginica
#> 10 7.2 3.6 6.1 2.5 virginica
#> # … with 40 more rows
#>
#> attr(,"ptype")
#> # A tibble: 0 x 5
#> # … with 5 variables: Sepal.Length <dbl>, Sepal.Width <dbl>,
#> # Petal.Length <dbl>, Petal.Width <dbl>, Species <fct>
group_split does not preserve names. From ?group_split
it does not name the elements of the list based on the grouping as this typically loses information and is confusing.
You could use base base::split for that
split(iris, iris$Species)
Or name the list of tibbles separately using setNames.
library(dplyr)
group_split(ir) %>% setNames(unique(iris$Species))
group_split split based on factor levels of data, so if we want to split them based on their occurrence in the data, we might have to rearrange the factor levels. In iris dataset the factor levels are in the same order as they occur in the data, hence the above works.
More generally we should use.
iris %>%
mutate(Species= factor(Species, levels = unique(Species))) %>%
group_split(Species) %>%
setNames(unique(iris$Species))
We can use set_names from tidyverse
library(tidyverse)
ir %>%
group_split() %>%
set_names(levels(iris$Species))