How do you efficiently group by multiple columns in dplyr

How do you efficiently group by multiple columns in dplyr - r

With dplyr you can group by columns like this:
library(dplyr)
df <- data.frame(a=c(1,2,1,3,1,4,1,5), b=c(2,3,4,1,2,3,4,5))
df %>%
group_by(a) %>%
summarise(count = n())
If I want to group by two columns all the guides say:
df %>%
group_by(a,b) %>%
summarise(count = n())
But can I not feed the group_by() parameters more efficiently somehow, rather than having to type them in explicitly, e.g. like:
cols = colnames(df)
df %>%
group_by(cols) %>%
summarise(count = n())
I have examples where I want to group by 10+ columns, and it is pretty horrible to write it out if you can just parse their names.

across and curly-curly is the answer (even though it doesn't make sense to group_by using all your columns)
cols = colnames(df)
df %>%
group_by(across({{cols}}) %>%
summarise(count = n())

You can use across with any of the tidy selectors. For example if you want all columns
df %>%
group_by(across(everything())) %>%
summarise(count = n())
Of if you want a list
cols <- c("a","b")
df %>%
group_by(across(all_of(cols))) %>%
summarise(count = n())
See help("language", package="tidyselect") for all the selection options.

Related

Unique rows based on two logical conditions

I want my dataframe to return unique rows based on two logical conditions (OR not AND).
But when I ran this, df %>% group_by(sex) %>% distinct(state, education) %>% summarise(n=n()) I got deduplicated rows based on the two conditions joined by AND not OR.
Is there a way to get something like this df %>% group_by(sex) %>% distinct(state | education) %>% summarise(n=n()) so that the deduplicated rows will be joined by OR not AND?
Thank you.

You can use tidyr::pivot_longer and then distinct afterwards:
df %>%
pivot_longer(c(state, education), names_to = "type", values_to = "value")
group_by(sex) %>%
distinct(value) %>%
summarise(n = n())
In this case, pivot_longer simply puts state and education into one column called value.

How to reuse parts of long chain of pipe operators in R?

I have a set of chains of pipe operators (%>%) doing different things with different datasets.
For instance:
dataset %>%
mutate(...) %>%
filter(...) %>%
rowwise() %>%
summarise() %>%
etc...
If I want to reuse some parts of these chains, is there a way to do it, without just wrapping it into a function?
For instance (in pseudocode obviously):
subchain <- filter(...) %>%
rowwise() %>%
summarise()
# and then instead of the chain above it would be:
dataset %>%
mutate(...) %>%
subchain() %>%
etc...

Similar in syntax to desired pseudo-code:
library(dplyr)
subchain <- . %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld)
all.equal(
starwars %>%
group_by(gender) %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld),
starwars %>%
group_by(gender) %>%
subchain()
)
Using a dot . as start of a piping sequence. This is in effect close to function wrapping, but this is called a magrittr functional sequence. See ?functions and try magrittr::functions(subchain)

summing up the first 5 elements of a list

I have a data frame that contains a column with varying numbers of integer values. I need to take the first five of these values and sum them up. I found a way to do it for one, but can't seem to generalize it to loop through all:
Here is the code for the first element:
results$occupied[1] %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()
And what does not work for all elements:
results %>%
rowwise() %>%
select(occupied) %>%
as.character() %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()

In base R, you can do :
sapply(strsplit(results$occupied, ","), function(x) sum(as.numeric(head(x, 5))))
Or using dplyr and purrr
library(dplyr)
library(purrr)
results %>%
mutate(total_sum = map_dbl(strsplit(occupied, ","),
~sum(as.numeric(head(.x, 5)))))
Similarly, using rowwise :
results %>%
rowwise() %>%
mutate(total_sum = sum(as.numeric(head(strsplit(occupied, ",")[[1]], 5))))

We can use separate_rows to split the 'occupied' column and expand the rows, then do a group by row number and get the sum of the first five elements
library(dplyr)
library(tidyr)
results %>%
mutate(rn = row_number()) %>%
separate_rows(occupied, convert = TRUE) %>%
group_by(rn) %>%
slice(seq_len(5)) %>%
summmarise(total_sum = sum(occupied)) %>%
select(-rn) %>%
bind_cols(results, .)

get sequence of group in R

So I have already done what I need but I am sure that there is a better way to do that
library(tidyverse)
library(schrute)
office <- schrute::theoffice
top_3_lines_per_episode <- office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>% ungroup()
epi_num<-top_3_lines_per_episode %>%
select(episode_name) %>%
unique() %>%
mutate(episode_num=row_number())
top_3_lines_per_episode %>%
inner_join(epi_num)
I want to generate column epi_num which gets done using above. A simpler way to do that. I have look at group_indices from SO but I think thats derecated. Need a better way preferably in tidyverse.

I am not aware of group_indices being deprecated in fact it seems to be a perfect option here.
library(dplyr)
office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>%
ungroup %>%
mutate(episode_num = group_indices(., season,episode,episode_name,imdb_rating))
Another option is to combine the columns with unite and then match to get episode_num.
office %>%
group_by(season,episode,episode_name,imdb_rating) %>%
count(character) %>%
top_n(3, n) %>%
ungroup %>%
tidyr::unite(temp, season,episode,episode_name,imdb_rating, remove = FALSE) %>%
mutate(episode_num = match(temp, unique(temp))) %>%
select(-temp)

Summarise with multiple conditions based on years

I would like to create a set of columns based on papers count for each number of year, therefore filtering multiple conditions in dplyr through summarise:
This is my code:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(papers_count = n()) %>%
arrange(desc(papers_count))
The code above gives me two columns, 'word' and 'papers_count', I would like to create more columns like papers_count (papers_count1990, papers_count1991, etc..) based on each year between 1990 and 2017.
I Am looking for something like ths:
words_list <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(word) %>%
summarise(tot_papers_count = n(), papers_count_1991 = n()year="1991", ...) %>%
arrange(desc(papers_count))
please does anybody have any suggestion?

I would suggest adding year to the group_by, and then using spread to create multiple summary columns.
library(tidyr)
words_list_by_year <- data %>%
select(Keywords, year) %>%
unnest_tokens(word, Keywords) %>%
filter(between(year,1990,2017)) %>%
group_by(year,word) %>%
summarise(papers_count = n()) %>%
spread(year,papers_count,fill=0)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How do you efficiently group by multiple columns in dplyr - r

across and curly-curly is the answer (even though it doesn't make sense to group_by using all your columns) cols = colnames(df) df %>% group_by(across({{cols}}) %>% summarise(count = n())

Related

Unique rows based on two logical conditions

How to reuse parts of long chain of pipe operators in R?

summing up the first 5 elements of a list

get sequence of group in R

Summarise with multiple conditions based on years

Categories

Resources