dplyr conditional summarise function - r

I have this situation where I need a different summary function based on a condition.
For example, using iris, say for some reason I wanted the sum of the petal width if the species was setosa, otherwise I wanted the mean of the petal width.
Naively, I wrote this using case_when, which does not work:
iris <- tibble::as_tibble(iris)
iris %>%
group_by(Species) %>%
summarise(pwz = case_when(
Species == "setosa" ~ sum(Petal.Width, na.rm = TRUE),
TRUE ~ mean(Petal.Width, na.rm = TRUE)))
Error in summarise_impl(.data, dots) :
Column pwz must be length 1 (a summary value), not 50
I eventually found something like this, summarizing using each method, and then in a mutate picking which one I actually wanted:
iris %>%
group_by(Species) %>%
summarise(pws = sum(Petal.Width, na.rm = TRUE),
pwm = mean(Petal.Width, na.rm = TRUE)) %>%
mutate(pwz = case_when(
Species == "setosa" ~ pws,
TRUE ~ pwm)) %>%
select(-pws, -pwm)
But that seems more than a bit awkward with creating all these summarized values and only picking one at the end, especially when my real case_when is a lot more complicated. Can I not use case_when inside of summarise? Do I have my syntax wrong? Any help is appreciated!
Edit: I suppose I should have pointed out that I have multiple conditions/functions (just assume I've got, depending on the variable, some that need mean, sum, max, min, or other summary).

This is pretty easy with data.table
library(data.table)
iris2 <- as.data.table(iris)
iris2[, if(Species == 'setosa') sum(Petal.Width)
else mean(Petal.Width)
, by = Species]
More concisely, but maybe not as clear
iris2[, ifelse(Species == 'setosa', sum, mean)(Petal.Width)
, by = Species]
With dplyr you can do
iris %>%
group_by(Species) %>%
summarise(pwz = if_else(first(Species == "setosa")
, sum(Petal.Width)
, mean(Petal.Width)))
Note:
I'm thinking it probably makes more sense to "spread" your data with tidyr::spread so that each day has a column for temperature, rainfall, etc. Then you can use summarise in the usual way.

Why not calculate at the row level first, then summarize?
iris %>% group_by(Species) %>% mutate(pwz = case_when(
Species == "setosa" ~ sum(Petal.Width, na.rm = TRUE),
TRUE ~ mean(Petal.Width, na.rm = TRUE))) %>%
summarize(pwz= first(pwz))
# A tibble: 3 x 2
Species pwz
<fctr> <dbl>
1 setosa 12.300
2 versicolor 1.326
3 virginica 2.026

data(iris)
library(dplyr)
sum_species <- c('setosa')
iris %>%
group_by(Species) %>%
summarise(pwz_sum = sum(Petal.Width, na.rm=T),
pwz_mean= mean(Petal.Width, na.rm=T)) %>%
ungroup() %>%
mutate(pwz = if_else(Species %in% sum_species, pwz_sum, pwz_mean))

You could always do something like this if you want to put everything in the summarise function. But it's no less complicated than your original workaround:
iris %>%
group_by(Species) %>%
summarise(pwz =
sum(Petal.Width, na.rm = TRUE)*
(1/n()*mean(Species != "setosa") +
mean(Species == "setosa")))

You could split your data.frame and then use map2_dfr to apply a different function on each part and stitch the results back together:
library(tidyverse) # purrr & dplyr
iris %>%
arrange(Species=="setosa") %>%
split(.,.$Species=="setosa") %>%
map2_dfr(c(mean,sum),~.x %>% group_by(Species) %>% summarize_at("Petal.Width",.y))
# # A tibble: 3 x 2
# Species Petal.Width
# <fctr> <dbl>
# 1 versicolor 1.326
# 2 virginica 2.026
# 3 setosa 12.300

Related

filter in one group only instead of all

I was wondering if there is a way to only filter/exclude inside one group after using the group_by function in dplyr. Atm the filter functions is run on all groups, although I only want to exclude certain values inside one group.
This keeps only rows for which Petal.Length > 1.5 for the setosa Species and keeps all rows for the other Species.
1) cur_group() returns a one-row tibble with a column for each grouping variable giving its current value.
library(dplyr)
iris %>%
group_by(Species) %>%
filter(Petal.Length > 1.5 | cur_group()$Species != "setosa") %>%
ungroup
2) Within a group_modify using a formula argument .x refers to the non-group columns and .y is similar to cur_group in (1).
iris %>%
group_by(Species) %>%
group_modify(
~ if (.y$Species == "setosa") filter(.x, Petal.Length > 1.5) else .x
) %>%
ungroup
Maybe in this case you can formulate it without group_by()?
iris |> filter(Species != 'setosa' | Petal.Length > 1.5)
# Perhaps clearer:
iris |> filter(!(Species == 'setosa' & Petal.Length <= 1.5))

R 3.5.2: Pipe inside custom function - object 'column' not found [duplicate]

This question already has answers here:
Pass arguments to dplyr functions
(7 answers)
Closed 2 years ago.
I am having issues with pipes inside a custom function. Based on the previous posts, I understand that a pipe inside a function creates another level(?) which results in the error I'm getting (see below).
I'm hoping to write a summary function for a large data set with hundreds of numeric and categorical variables. I would like to have the option to use this on different data frames (with similar structure), always group by a certain factor variable and get summaries for multiple columns.
library(tidyverse)
data(iris)
iris %>% group_by(Species) %>% summarise(count = n(), mean = mean(Sepal.Length, na.rm = T))
# A tibble: 3 x 3
Species count mean
<fct> <int> <dbl>
1 setosa 50 5.01
2 versicolor 50 5.94
3 virginica 50 6.59
I'm hoping to create a function like this:
sum_cols <- function (df, col) {
df %>%
group_by(Species) %>%
summarise(count = n(),
mean = mean(col, na.rm = T))
}
And this is the error I'm getting:
sum_cols(iris, Sepal.Length)
Error in mean(col, na.rm = T) : object 'Petal.Width' not found
Called from: mean(col, na.rm = T)
I have had this problem for a while and even though I tried to get answers in a few previous posts, I haven't quite grasped why the problem occurs and how to get around it.
Any help would be greatly appreciated, thanks!
Try searching for non-standard evaluation (NSE).
You can use here {{}} to let R know that col is the column name in df.
library(dplyr)
library(rlang)
sum_cols <- function (df, col) {
df %>%
group_by(Species) %>%
summarise(count = n(), mean = mean({{col}}, na.rm = T))
}
sum_cols(iris, Sepal.Length)
# A tibble: 3 x 3
# Species count mean
# <fct> <int> <dbl>
#1 setosa 50 5.01
#2 versicolor 50 5.94
#3 virginica 50 6.59
If we do not have the latest rlang we can use the old method of enquo and !!
sum_cols <- function (df, col) {
df %>%
group_by(Species) %>%
summarise(count = n(), mean = mean(!!enquo(col), na.rm = T))
}
sum_cols(iris, Sepal.Length)

Repetitive filtering with multiple conditions without a loop

I have a large dataset of around 35000 observations and 24 variables (one of which is a time-series), but I can summarise what I want to achieve using iris.
library(tidyverse)
iris.new <- iris %>%
arrange(Species, Sepal.Length, Sepal.Width) %>%
group_by(Species)
unwanted <- iris.new %>%
filter(Sepal.Length > 5 & Sepal.Width==min(Sepal.Width))
while(nrow(unwanted)!=0) {
iris.new <- iris.new %>%
arrange(Species, Sepal.Length, Sepal.Width) %>%
group_by(Species) %>%
filter(!(Sepal.Length > 5 & Sepal.Width == min(Sepal.Width)))
unwanted <- iris.new %>%
filter(Sepal.Length > 5 & Sepal.Width==min(Sepal.Width))
}
I want to filter only Sepal.Length > 5, which has minimum Sepal.Width within observations for each Species (setosa and versicolor has none). When I got rid of the first one, I repeat the filter to see if there are any and finally used a 'while' loop to do that for me.
Is there a way to filter them without using a loop?
I think this does the trick:
# get minimum Sepal.Width without Sepal.Length > 5
iris_min <- iris %>%
group_by(Species) %>%
filter(Sepal.Length <= 5) %>%
summarize(min_sep_width = min(Sepal.Width))
# check to see that nothing is below our minimum
# or equal to it with a Sepal.Length that's too long
iris_new <- iris %>%
left_join(iris_min, by = c('Species')) %>%
filter(min_sep_width < Sepal.Width |
(min_sep_width == Sepal.Width & Sepal.Length <= 5)) %>%
select(-min_sep_width)

How can I get the overall stats when using group_by in dplyr?

I am using dplyr to calculate some summary statistics across groups, but I would also like to get the same stats for all the data (in the same line of code)
So far I can only think of:
aux.1 <- iris %>%
group_by(Species) %>%
summarise("stat1" = mean(Sepal.Length),
"stat2" = sum(Petal.Length) )
aux.2 <- iris %>%
summarise("stat1" = mean(Sepal.Length),
"stat2" = sum(Petal.Length) )
Anyway I can get all the stats in one line of code?
You need two separate dplyr chains, but you can put it all together with bind_rows:
aux <- bind_rows(
iris %>%
group_by(Species) %>%
summarise("stat1" = mean(Sepal.Length),
"stat2" = sum(Petal.Length)),
iris %>%
summarise("stat1" = mean(Sepal.Length),
"stat2" = sum(Petal.Length)) %>%
mutate(Species = "All")
)
aux
Species stat1 stat2
1 setosa 5.006000 73.1
2 versicolor 5.936000 213.0
3 virginica 6.588000 277.6
4 All 5.843333 563.7
In case you are interested in taking a look at the data.table package, this is easy to achieve:
library(data.table)
# have to make a copy of the internal data.frame for testing
irisTemp <- iris
setDT(irisTemp)
# calculate group statistics
irisTemp[, c("meanVal", "sumVal") := .(mean(Sepal.Length), sum(Petal.Length)),
by="Species"]
This can be a quick and efficient library for large data sets.

Residualize an observation after fitting a model in group_by

I'd like to find the residual of observations after fitting a model per group. I would have thought the code looks something like
library(dplyr)
df %>%
group_by(group) %>%
do(residual=resid(lm(y~x, data=.))) %>%
ungroup()
but this collapses df and leaves no trace of the x variable. What I want is a data frame return that is something like
group |y| x| residual
1) dplyr For purposes of example, this uses the iris data frame that comes with R. I noticed that the code below chokes on the formula if we remove the double quotes but it works OK if the formula is passed as a character string as shown:
iris %>%
group_by(Species) %>%
do(mutate(., resid = resid(lm("Sepal.Length ~ Sepal.Width", .)))) %>%
ungroup()
1a) This variation also works even without a character string formula:
iris %>%
group_by(Species) %>%
do(cbind(., resid = resid(lm(Sepal.Length ~ Sepal.Width, .)))) %>%
ungroup()
1b) and this variation also works:
iris %>%
group_by(Species) %>%
do(transform(., resid = resid(lm(Sepal.Length ~ Sepal.Width, .)))) %>%
ungroup()
2) Base R We could also consider not using dplyr and just base R like this:
f <- function(ix) resid(lm(Sepal.Length ~ Sepal.Width, iris, subset = ix))
transform(iris, resid = ave(seq_along(Species), Species, FUN = f))
3) data.table If speed is of concern you might want to try data.table which is often the fastest approach and is also quite compact here:
library(data.table)
dt <- as.data.table(iris)
dt[, resid := resid(lm(Sepal.Length ~ Sepal.Width, .SD)), by = Species]
3a) Interestingly this variation of (1) works with data.table input and an actual formula (not character string). Also, do() is not needed:
data.table(iris) %>%
group_by(Species) %>%
mutate(resid = resid(lm(Sepal.Length ~ Sepal.Width, .))) %>%
ungroup()
Note: I have added dplyr issue 1648.

Resources