dplyr summarise and then summarise_at in the same pipe - r

This question has come up before and there are some solutions but none that I could find for this specific case. e.g.
my_diamonds <- diamonds %>%
mutate(blah_var1 = rnorm(n()),
blah_var2 = rnorm(n()),
blah_var3 = rnorm(n()),
blah_var4 = rnorm(n()),
blah_var5 = rnorm(n()))
my_diamonds %>%
group_by(cut) %>%
summarise(MaxClarity = max(clarity),
MinTable = min(table), .groups = 'drop') %>%
summarise_at(vars(contains('blah')), mean)
Want a new df showing the max clarity, min table and mean of each of the blah variables. The above returned an empty tibble. Based on some other SO posts I tried using mutate and then summarise at:
my_diamonds %>%
group_by(cut) %>%
mutate(MaxClarity = max(clarity),
MinTable = min(table)) %>%
summarise_at(vars(contains('blah')), mean)
This returns a tibble but only for the blah variables, MaxClarity and MinTable are missing.
Is there a way to combine summarise and summarise_at in the same dplyr chain?

One issue with the summarise is that after the first call of summarise, we get only the columns in the grouping i.e. the 'cut' along with and the summarised columns i.e. 'MaxClarity' and 'MinTable'. In addition, after the first summarise step, the grouping is removed with groups = 'drop'
library(dplyr) # version >= 1.0
my_diamonds %>%
group_by(cut) %>%
summarise(MaxClarity = max(clarity),
MinTable = min(table),
across(contains('blah'), mean, na.rm = TRUE), .groups = 'drop')

Related

How to combine summarize and summarize_if in dplyr

I would like to combine a summarize statement (to count the number of observations) with a summarise_if statement (to summarise all numeric variables).
Using data("iris"), I would like to:
Count the number of observations per Species and add this count as a column in the new table.
Summarise all numeric variables (Sepal.Length,Sepal.Width, Petal.Length, Petal.Width) by Species.
I can do these steps separately with the code below:
Number 1.
iris %>%
group_by(Species)%>%
summarise(n = n())
Number 2.
iris %>%
group_by(Species)%>%
summarise_if(is.numeric, median, na.rm = TRUE)
Q: How to combine these calculations into one step?
Just piping one after the other gives me a different result. My desired output is this:
Use across:
iris %>%
group_by(Species) %>%
summarise(n = n(), across(where(is.numeric), median, na.rm = TRUE))
For those interested, the same thing in data.table:
setDT(iris)
iris[, j = data.frame(n = .N, lapply(.SD, median, na.rm = TRUE)),
.SDcols = names(iris)[sapply(iris, is.numeric)],
by = Species]

How do you efficiently group by multiple columns in dplyr

With dplyr you can group by columns like this:
library(dplyr)
df <- data.frame(a=c(1,2,1,3,1,4,1,5), b=c(2,3,4,1,2,3,4,5))
df %>%
group_by(a) %>%
summarise(count = n())
If I want to group by two columns all the guides say:
df %>%
group_by(a,b) %>%
summarise(count = n())
But can I not feed the group_by() parameters more efficiently somehow, rather than having to type them in explicitly, e.g. like:
cols = colnames(df)
df %>%
group_by(cols) %>%
summarise(count = n())
I have examples where I want to group by 10+ columns, and it is pretty horrible to write it out if you can just parse their names.
across and curly-curly is the answer (even though it doesn't make sense to group_by using all your columns)
cols = colnames(df)
df %>%
group_by(across({{cols}}) %>%
summarise(count = n())
You can use across with any of the tidy selectors. For example if you want all columns
df %>%
group_by(across(everything())) %>%
summarise(count = n())
Of if you want a list
cols <- c("a","b")
df %>%
group_by(across(all_of(cols))) %>%
summarise(count = n())
See help("language", package="tidyselect") for all the selection options.

Is there a way to combine across() and mutate() if I am referencing column names from a list?

The dataset below has columns with very similar names and some values which are NA.
library(tidyverse)
dat <- data.frame(
v1_min = c(1,2,4,1,NA,4,2,2),
v1_max = c(1,NA,5,4,5,4,6,NA),
other_v1_min = c(1,1,NA,3,4,4,3,2),
other_v1_max = c(1,5,5,6,6,4,3,NA),
y1_min = c(3,NA,2,1,2,NA,1,2),
y1_max = c(6,2,5,6,2,5,3,3),
other_y1_min = c(2,3,NA,1,1,1,NA,2),
other_y1_max = c(5,6,4,2,NA,2,NA,NA)
)
head(dat)
In this example, x1 and y1 would be what I would consider the common "categories" among the columns. In order to get something similar with my current dataset, I had to use grepl to tease these out
cats<-dat %>%
names() %>%
gsub("^(.*)_(min|max)", "\\1",.) %>%
gsub("^(.*)_(.*)", "\\2",.) %>%
unique()
Now, my goal is to mutate a new min and a new max column for each of those categories. So far the code below works just fine.
dat %>%
rowwise() %>%
mutate(min_v1 = min(c_across(contains(cats[1])), na.rm=T)) %>%
mutate(max_v1 = max(c_across(contains(cats[1])), na.rm=T)) %>%
mutate(min_y1 = min(c_across(contains(cats[2])), na.rm=T)) %>%
mutate(max_y1 = max(c_across(contains(cats[2])), na.rm=T))
However, the number of categories in my current dataset is quite a bit bigger than 2.. Is there a way to implement this but quicker?
I've tried a few of the suggestions on this post but haven't quite been able to extend them to this problem.
You can use one of the map function here for each common categories.
library(dplyr)
library(purrr)
result <- bind_cols(dat, map_dfc(cats,
~dat %>%
rowwise() %>%
transmute(!!paste('min', .x, sep = '_') := min(c_across(matches(.x)), na.rm = TRUE),
!!paste('max', .x, sep = '_') := max(c_across(matches(.x)), na.rm = TRUE))))
result

Can't combine <character> and <double> on adding sort function

I am trying to sort the data based on the median price i.e m , but when I added sort function it throwing me an error that
Error: Can't combine locationName character and m double
how can I sort data based on newly mutated column in my case m which median price ?
df %>%
filter_at(.vars= vars(area), all_vars(grepl('10 Marla',.))) %>%
group_by(locationName,area,city) %>%
mutate(m = median(price)) %>%
select(locationName,area,city,m) %>%
sort(m,decreasing = TRUE)
We can use sort within mutate
library(dplyr)
df %>%
filter_at(.vars= vars(area), all_vars(grepl('10 Marla',.))) %>%
group_by(locationName,area,city) %>%
mutate(m = median(price)) %>%
select(locationName,area,city,m) %>%
mutate(m = sort(m,decreasing = TRUE))
If the intention is to order the rows based on 'm', use arrange
df %>%
filter_at(.vars= vars(area), all_vars(grepl('10 Marla',.))) %>%
group_by(locationName,area,city) %>%
mutate(m = median(price)) %>%
select(locationName,area,city,m) %>%
arrange(desc(m))

Incorporating na.rm=TRUE into Summarise_Each for Multiple Functions in dplyr

So I have a dplyr table movie_info_comb from which I am calculating various statistics on one column metascore. Here is the code:
summarise_each_(movie_info_comb, funs(min,max,mean,sum,sd,median,IQR),"metascore")
How do incorporate na.rm=TRUE? I've only seen examples for which one statistic is being calculated and I'd hate to have to repeat this 5 times (one for each function.
Thanks in advance.
You can do this with lazy evaluation
library(lazyeval)
na.rm = function(FUN_string)
lazy(FUN(., na.rm = TRUE)) %>%
interp(FUN = FUN_string %>% as.name)
na.rm.apply = function(FUN_strings)
FUN_strings %>%
lapply(na.rm) %>%
setNames(FUN_strings)
mtcars %>%
select(mpg) %>%
summarize_each(
c("min","max","mean","sum","sd","median","IQR") %>%
na.rm.apply)

Resources