Summarising multiple columns across multiple functions by group

Summarising multiple columns across multiple functions by group - r

How can I write these two code together into one instead of having them separate? They work well but I would like to have them into one code. That is, have the total number of production and their average by task and department.
df %>%
group_by(task, department) %>%
summarise(across(.cols = c(production),
.fns = sum,
na.rm=T))
df %>%
group_by(task, department) %>%
summarise(across(.cols = c(hours),
.fns = mean,
na.rm=T))

df %>%
group_by(task, department) %>%
summarise(mean = mean(hours), sum = sum(production))

Related

Unique rows based on two logical conditions

I want my dataframe to return unique rows based on two logical conditions (OR not AND).
But when I ran this, df %>% group_by(sex) %>% distinct(state, education) %>% summarise(n=n()) I got deduplicated rows based on the two conditions joined by AND not OR.
Is there a way to get something like this df %>% group_by(sex) %>% distinct(state | education) %>% summarise(n=n()) so that the deduplicated rows will be joined by OR not AND?
Thank you.

You can use tidyr::pivot_longer and then distinct afterwards:
df %>%
pivot_longer(c(state, education), names_to = "type", values_to = "value")
group_by(sex) %>%
distinct(value) %>%
summarise(n = n())
In this case, pivot_longer simply puts state and education into one column called value.

R - Issue with Ranking and Grouping

I have the following question that I am trying to solve with R:
"For each year, first calculate the mean observed value for each country (to allow for settings where countries may have more than 1 value per year, note that this is true in this data set). Then rank countries by increasing MMR for each year.
Calculate the mean ranking across all years, extract the mean ranking for 10 countries with the lowest ranking across all years, and print the resulting table."
This is what I have so far:
dput(mmr)
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean = mean(mmr)) %>%
arrange(mean) %>%
group_by(country)
tib2
My output is so close to where I need it to be, I just need to make each country have only one row (that has the mean ranking for each country).
Here is the result:
Output
Thank you!

Just repeat the same analysis, but instead of grouping by (country, year), just group by country:
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean_mmr = mean(mmr)) %>%
arrange(mean) %>%
group_by(country) %>%
summarise(mean_mmr = mean(mean_mmr)) %>%
arrange(mean_mmr) %>%
ungroup() %>%
slice_min(n=10)
tib2

Not sure without the data, but does this work?
tib2 <- mmr %>%
group_by(country, year) %>%
summarise(mean1 = mean(mmr)) %>%
ungroup() %>%
group_by(year) %>%
mutate(rank1 = rank(mean1)) %>%
ungroup() %>%
group_by(country) %>%
summarise(rank = mean(rank1))%>%
ungroup() %>%
arrange(rank) %>%
slice_head(n=10)

Can't combine <character> and <double> on adding sort function

I am trying to sort the data based on the median price i.e m , but when I added sort function it throwing me an error that
Error: Can't combine locationName character and m double
how can I sort data based on newly mutated column in my case m which median price ?
df %>%
filter_at(.vars= vars(area), all_vars(grepl('10 Marla',.))) %>%
group_by(locationName,area,city) %>%
mutate(m = median(price)) %>%
select(locationName,area,city,m) %>%
sort(m,decreasing = TRUE)

We can use sort within mutate
library(dplyr)
df %>%
filter_at(.vars= vars(area), all_vars(grepl('10 Marla',.))) %>%
group_by(locationName,area,city) %>%
mutate(m = median(price)) %>%
select(locationName,area,city,m) %>%
mutate(m = sort(m,decreasing = TRUE))
If the intention is to order the rows based on 'm', use arrange
df %>%
filter_at(.vars= vars(area), all_vars(grepl('10 Marla',.))) %>%
group_by(locationName,area,city) %>%
mutate(m = median(price)) %>%
select(locationName,area,city,m) %>%
arrange(desc(m))

dplyr summarise and then summarise_at in the same pipe

This question has come up before and there are some solutions but none that I could find for this specific case. e.g.
my_diamonds <- diamonds %>%
mutate(blah_var1 = rnorm(n()),
blah_var2 = rnorm(n()),
blah_var3 = rnorm(n()),
blah_var4 = rnorm(n()),
blah_var5 = rnorm(n()))
my_diamonds %>%
group_by(cut) %>%
summarise(MaxClarity = max(clarity),
MinTable = min(table), .groups = 'drop') %>%
summarise_at(vars(contains('blah')), mean)
Want a new df showing the max clarity, min table and mean of each of the blah variables. The above returned an empty tibble. Based on some other SO posts I tried using mutate and then summarise at:
my_diamonds %>%
group_by(cut) %>%
mutate(MaxClarity = max(clarity),
MinTable = min(table)) %>%
summarise_at(vars(contains('blah')), mean)
This returns a tibble but only for the blah variables, MaxClarity and MinTable are missing.
Is there a way to combine summarise and summarise_at in the same dplyr chain?

One issue with the summarise is that after the first call of summarise, we get only the columns in the grouping i.e. the 'cut' along with and the summarised columns i.e. 'MaxClarity' and 'MinTable'. In addition, after the first summarise step, the grouping is removed with groups = 'drop'
library(dplyr) # version >= 1.0
my_diamonds %>%
group_by(cut) %>%
summarise(MaxClarity = max(clarity),
MinTable = min(table),
across(contains('blah'), mean, na.rm = TRUE), .groups = 'drop')

Do I have to use collect with disk frames?

This question is a follow-up from this thread
I'd like to perform three actions on a disk frame
Count the distinct values of the field id grouped by two columns (key_a and key_b)
Count the distinct values of the field id grouped by the first of two columns (key_a)
Add a column with the distinct values for the first column / the distinct values across both columns
This is my code
my_df <-
data.frame(
key_a = rep(letters, 384),
key_b = rep(rev(letters), 384),
id = sample(1:10^6, 9984)
)
my_df %>%
select(key_a, key_b, id) %>%
chunk_group_by(key_a, key_b) %>%
# stage one
chunk_summarize(count = n_distinct(id)) %>%
collect %>%
group_by(key_a, key_b) %>%
# stage two
mutate(count_summed = sum(count)) %>%
group_by(key_a) %>%
mutate(count_all = sum(count)) %>%
ungroup() %>%
mutate(percent_of_total = count_summed / count_all)
My data is in the format of a disk frame, not a data frame, and it has 100M rows and 8 columns.
I'm following the two step instructions described in this documentation
I'm concerned that the collect will crash my machine since it brings everything to ram
Do I have to use collect in order to use dplyr group bys in disk frame?

You should always use srckeep to load only those columns you need into memory.
my_df %>%
srckeep(c("key_a", "key_b", "id")) %>%
# select(key_a, key_b, id) %>% # no need if you use srckeep
chunk_group_by(key_a, key_b) %>%
# stage one
chunk_summarize(count = n_distinct(id)) %>%
collect %>%
group_by(key_a, key_b) %>%
# stage two
mutate(count_summed = sum(count)) %>%
group_by(key_a) %>%
mutate(count_all = sum(count)) %>%
ungroup() %>%
mutate(percent_of_total = count_summed / count_all)
collect will only bring the results of computing chunk_group_by and chunk_summarize into RAM. It shouldn't crash your machine.
You must use collect just like other systems like Spark.
But if you are computing n_distinct, that can be done in one-stage anyway
my_df %>%
srckeep(c("key_a", "key_b", "id")) %>%
#select(key_a, key_b, id) %>%
group_by(key_a, key_b) %>%
# stage one
summarize(count = n_distinct(id)) %>%
collect
If you really concerned about RAM usage, you can reduce the number of workers to 1
setup_disk.frame(workers=1)
my_df %>%
srckeep(c("key_a", "key_b", "id")) %>%
#select(key_a, key_b, id) %>%
group_by(key_a, key_b) %>%
# stage one
summarize(count = n_distinct(id)) %>%
collect
setup_disk.frame()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Summarising multiple columns across multiple functions by group - r

df %>% group_by(task, department) %>% summarise(mean = mean(hours), sum = sum(production))

Related

Unique rows based on two logical conditions

R - Issue with Ranking and Grouping

Can't combine <character> and <double> on adding sort function

dplyr summarise and then summarise_at in the same pipe

Do I have to use collect with disk frames?

Categories

Resources