dplyr summarise (collapse) dataset by different functions for multiple columns - r

I'm trying to dplyr::summarise a dataset (collapse) by different summarise_at/summarise_if functions so that I have the same named variables in my output dataset. Example:
library(tidyverse)
data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris$letter <- as.factor(rep(c("A","B","C"),each=50)) ## factor column
head(iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species year color letter
1 5.1 3.5 1.4 0.2 setosa 2000 red A
2 4.9 3.0 1.4 0.2 setosa 2000 red A
3 4.7 3.2 1.3 0.2 setosa 2000 red A
The resulting dataset should look like this:
full
Species year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 setosa 2000 87 6.2 5.8 1.9 A red
2 setosa 3000 84.4 6.1 5.5 1.9 A red
3 versicolor 2000 69.4 33.6 7 4.9 B green
4 versicolor 3000 69.1 32.7 6.8 5.1 B green
5 virginica 2000 73.2 51.1 7.7 6.9 C blue
6 virginica 3000 75.5 50.2 7.9 6.4 C blue
I can achieve this by doing the following which is a bit repetitive:
sums <- iris %>%
group_by(Species, year) %>%
summarise_at(vars(matches("Width")), list(sum))
max <- iris %>%
group_by(Species, year) %>%
summarise_at(vars(matches("Length")), list(max))
last <- iris %>%
group_by(Species, year) %>%
summarise_if(is.factor, list(last))
first <- iris %>%
group_by(Species, year) %>%
summarise_if(is.character, list(first))
full <- full_join(sums, max) %>% full_join(last) %>% full_join(first)
I have found similar approaches below but can't figure out the approach I've tried here. I would prefer not to make my own function as I think something like this is cleaner by passing everything through a pipe and joining:
test <- iris %>%
#group_by(.vars = vars(Species, year)) %>% #why doesnt this work?
group_by_at(.vars = vars(Species, year)) %>% #doesnt work
{left_join(
summarise_at(., vars(matches("Width")), list(sum)),
summarise_at(., vars(matches("Length")), list(max)),
summarise_if(., is.factor, list(last)),
summarise_if(., is.character, list(first))
)
} #doesnt work
This doesnt work, any suggestions or other approaches?
Helpful:
How can I use summarise_at to apply different functions to different columns?
Summarize different Columns with different Functions
Using dplyr summarize with different operations for multiple columns

By default, the dplyr::left_join() function only accepts two data frames. If you want to use this function with more than two data frames, you can iterate it with the Reduce function (base R function):
iris %>%
group_by(Species, year) %>%
{
Reduce(
function(x, y) left_join(x, y),
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
))
}
# Species year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 setosa 2000 87 6.2 5.8 1.9 A red
# 2 setosa 3000 84.4 6.1 5.5 1.9 A red
# 3 versicolor 2000 69.4 33.6 7 4.9 B green
# 4 versicolor 3000 69.1 32.7 6.8 5.1 B green
# 5 virginica 2000 73.2 51.1 7.7 6.9 C blue
# 6 virginica 3000 75.5 50.2 7.9 6.4 C blue
Furthermore, notice I had to call functions from its package by using :: in order to avoid name overlapping with previously created data frames.

Robbing #Ulises idea and using purrr::reduce instead of Reduce is an alternative:
iris %>%
group_by(Species, year) %>%
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
) %>%
.[c(2:5)] %>%
reduce(left_join)
OR solution with curly brackets to suppress the first argument:
iris %>%
group_by(Species, year) %>%
{
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
)
} %>%
reduce(left_join)

Related

select n-th largest row per group

I am trying to select the n-th largest row per group in a dataset. Example, look at the iris dataset - I found this code on the internet that does this for the second largest value of sepal.length for each type of flower species :
library(dplyr)
myfun <- function(x) {
u <- unique(x)
sort(u, decreasing = TRUE)[2L]
}
iris %>%
group_by(Species) %>%
summarise(result = myfun(Sepal.Length))`
I am just trying to clarification if I have understand this correctly. If I want 3rd largest, do I just make change like this? How I can select all rows from original data?
library(dplyr)
myfun <- function(x) {
u <- unique(x)
sort(u, decreasing = TRUE)[3L]
}
iris %>%
group_by(Species) %>%
summarise(result = myfun(Sepal.Length))
`
Just modify the function to have an extra argument n to make it dynamic
myfun <- function(x, n) {
u <- unique(x)
sort(u, decreasing = TRUE)[n]
}
and then call as
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(result = myfun(Sepal.Length, 3))
-output
# A tibble: 3 × 2
Species result
<fct> <dbl>
1 setosa 5.5
2 versicolor 6.8
3 virginica 7.6
To get all the numeric columns, loop across the numeric columns
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), ~ myfun(.x, 3)))
# or use nth
# summarise(across(where(is.numeric), ~ nth(unique(.x),
# order_by = -unique(.x), 3)))
-output
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.5 4.1 1.6 0.4
2 versicolor 6.8 3.2 4.9 1.6
3 virginica 7.6 3.4 6.6 2.3
We could use nth from dplyr package after grouping and arrange:
library(dplyr)
iris %>%
group_by(Species) %>%
arrange(-Sepal.Length, .by_group = TRUE) %>%
summarise(across(, ~nth(unique(.x), 3)))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.5 3.8 1.7 0.3
2 versicolor 6.8 2.8 4.8 1.7
3 virginica 7.6 2.8 6.9 2.3

R dplyr with seq()

I need to mutate a column in a dataframe, with the seq of another column.
For example with iris, I would like to add a new column for each Species, with
seq(min(Sepal.Length),max(Sepal.Length),length=100)
I tried (with no success):
iris %>%
group_by(Species) %>%
mutate(seqq = seq(min(Sepal.Length),max(Sepal.Length), 100))
Any ideas?
thank you!
mutate needs to return the same number of rows as the original data or the ones in the group_by. We may use summarise
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(seq = seq(min(Sepal.Length),max(Sepal.Length),
length = 100), .groups = 'drop')
-output
# A tibble: 300 x 2
# Groups: Species [3]
Species seq
<fct> <dbl>
1 setosa 4.3
2 setosa 4.32
3 setosa 4.33
4 setosa 4.35
5 setosa 4.36
6 setosa 4.38
7 setosa 4.39
8 setosa 4.41
9 setosa 4.42
10 setosa 4.44
# … with 290 more rows

dplyr-summarise, keep the original group name

iris %>% mutate(subgroup=rep(c('A','B'),75)) %>% group_by(Species) %>% summarise(SLmin=min(Sepal.Length))
Species SLmin
<fct> <dbl>
1 setosa 4.3
2 versicolor 4.9
3 virginica 4.9
I want to keep the original subgroup name.
but
iris %>% mutate(subgroup=rep(c('A','B'),75)) %>% group_by(Species,subgroup) %>% summarise(SLmin=min(Sepal.Length))
Species subgroup SLmin
<fct> <chr> <dbl>
1 setosa A 4.4
2 setosa B 4.3
3 versicolor A 5
4 versicolor B 4.9
5 virginica A 4.9
6 virginica B 5.6
this code cannot get minimum at each species.
do you know any idea?
PS:
It was hard to explain, so I'll fix it.
I need subgroups.
After summarizing the results.
setosa B 4.3
versicolor B 4.9
virginica A 4.9
You can use which.min to get index of minimum value of Sepal.Length, this index can be used to subset corresponding subgroup value.
library(dplyr)
iris %>%
mutate(subgroup=rep(c('A','B'),75)) %>%
group_by(Species) %>%
summarise(SLmin=min(Sepal.Length),
subgroup = subgroup[which.min(Sepal.Length)])
# Species SLmin subgroup
# <fct> <dbl> <chr>
#1 setosa 4.3 B
#2 versicolor 4.9 B
#3 virginica 4.9 A
Also an alternative is to select the minimum row for each Species and then select only those columns that we need in the final output.
iris %>%
mutate(subgroup=rep(c('A','B'),75)) %>%
group_by(Species) %>%
slice(which.min(Sepal.Length))

Create unique random group id in R [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 2 years ago.
I am trying to create a unique, randomly assigned (without replacement) group id without using a for loop. This is as far as I got:
library(datasets)
library(dplyr)
data(iris)
iris <- iris %>% group_by(Species) %>% mutate(id = cur_group_id())
This gives me a group id for each iris$Species, however, I would like the group id to randomly assigned from c(1,2,3) as opposed to assigned based on the order of the dataset.
Any help creating this would be very helpful! I am sure there is a way to do this with dplyr but I am stumped...
Maybe you can play some tricks on group_by by adding sample operation, e.g.,
iris <- iris %>%
group_by(factor(Species, levels = sample(levels(Species)))) %>%
mutate(id = cur_group_id())
Here's a sample answer creating a random number and ranking them.
library(datasets)
library(dplyr)
data(iris)
df <- iris %>%
group_by(Species) %>%
mutate(id = runif(1,0,1)) %>%
ungroup() %>%
mutate(id = dense_rank(id))
df %>% sample_n(10)
#> # A tibble: 10 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 4.4 3 1.3 0.2 setosa 3
#> 2 6.5 3 5.5 1.8 virginica 2
#> 3 6.3 2.7 4.9 1.8 virginica 2
#> 4 5 3.6 1.4 0.2 setosa 3
#> 5 6.3 2.3 4.4 1.3 versicolor 1
#> 6 7.9 3.8 6.4 2 virginica 2
#> 7 5.4 3.9 1.7 0.4 setosa 3
#> 8 5.7 4.4 1.5 0.4 setosa 3
#> 9 6.4 2.8 5.6 2.2 virginica 2
#> 10 5.2 3.4 1.4 0.2 setosa 3
Created on 2020-07-29 by the reprex package (v0.3.0)
Here's an approach with sample and recode:
Use seq_along(unique(id)) to create a vector of integer values to recode to.
Use sample to sample the appropriate number of random values.
Use setNames to name the ids with their new random values.
Use !!! to force that vector of named id into a list of expressions.
use recode to change the values.
iris %>%
group_by(Species) %>%
mutate(id = cur_group_id()) %>%
mutate(id = recode(id, !!!setNames(unique(id),
sample(seq_along(unique(id))))))
I think the other answers are better approachs, but having recode with !!! in your toolkit is helpful in other situations.
Randomise the rows and then assign id based on the occurrence of Species :
library(dplyr)
iris %>%
slice_sample(n = nrow(.)) %>%
#sample_n for dplyr < 1.0.0
#sample_n(n()) %>%
mutate(id = match(Species, unique(Species)))

Grouped times series lag on selected variables using dplyr

I am trying to use dplyr to lag some variables (all of which have a common naming convention) for each group in my data set.
I thought mutate_if would work, but I get an error (below). mutate_each works, but for all columns rather than the select few.
For example, I were looking to lag only the Sepal measurements:
iris %>%
tbl_df() %>%
group_by(Species) %>%
slice(1:3) %>%
# mutate_each(funs(lag(.)))
mutate_if(contains("Sepal"), funs(lag(.)))
#> Error in get(as.character(FUN), mode = "function", envir = envir) : object 'p' of mode 'function' was not found
to get a final data set like:
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 NA NA 1.4 0.2 setosa
# 2 5.1 3.5 1.4 0.2 setosa
# 3 4.9 3.0 1.3 0.2 setosa
# 4 NA NA 4.7 1.4 versicolor
# 5 7.0 3.2 4.5 1.5 versicolor
# 6 6.4 3.2 4.9 1.5 versicolor
# 7 NA NA 6.0 2.5 virginica
# 8 6.3 3.3 5.1 1.9 virginica
# 9 5.8 2.7 5.9 2.1 virginica
This seems to work,
library(dplyr)
iris %>%
tbl_df() %>%
group_by(Species) %>%
slice(1:3) %>%
mutate_if(grepl('Sepal', names(.)), funs(lag(.)))
As #aosmith explains, contains returns an index of the columns that match the string, whereas mutate_if relies on a using predicate functions that return logical vectors, which is why the grepl option works.
In addition, as #StevenBeaupre mentions,
iris %>%
tbl_df() %>%
group_by(Species) %>%
slice(1:3) %>%
mutate_at(vars(contains('Sepal')), lag)

Resources