Obtain more variables after grouping, summarising with select (dplyr) - r

My data frame:
date | weekday | price
2018 | 1 | 25
2018 | 1 | 35
2019 | 2 | 40
I try to run this code under dplyr:
pi %>%
group_by(date) %>%
group_by(date) %>%
summarise(price = sum(price, na.rm = T)) %>%
select(price, date, weekday) %>%
print()
It doesn't work.
Any solution? Thanks in advance

Follow the order: select-->group_by-->summarise
df%>%select(price, date, weekday)%>%
group_by(date, weekday)%>%summarise(sum(price,na.rm=T))

People are correctly suggesting to group_by date and weekday, but if you have a lot of columns, that could be a pain to write out. Here's another idiom I frequently use for data.frames with lots of columns:
pi %>%
group_by(date) %>%
mutate(price = sum(price, na.rm = T)) %>%
filter(row_number() == 1)
This will keep all the first instances of each column variables without having to explicitly write them all out.

Related

Reclassify attributes that are less than x% of total as 'other'

Okay so I have data as so:
ID Name Job
001 Bill Carpenter
002 Wilma Lawyer
003 Greyson Lawyer
004 Eddie Janitor
I want to group these together for analysis so any job that appears less than x percent of the whole will be grouped into "Other"
How can I do this, here is what I tried:
df %>%
group_by(Job) %>%
summarize(count = n()) %>%
mutate(pct = count/sum(count)) %>%
arrange(desc(count)) %>%
drop_na()
And now I know what the percentages are but how do I integrate this in to the original data to make everything below X "Other". (let's say less than or equal to 25% is other).
Maybe there's a more straightforward way....
You can try this :
library(dplyr)
df %>%
count(Job) %>%
mutate(n = n/sum(n)) %>%
left_join(df, by = 'Job') %>%
mutate(Job = replace(Job, n <= 0.25, 'Other'))
To integrate our calculation in original data we do a left_join and then replace the values.

How to get percentage value of each column across all rows in R

Using R's tidyverse, how do I get the percentage value of each column across rows? Using the mpg dataset as an example, I've tried the following code:
new_mpg <- mpg %>%
group_by(manufacturer, model) %>%
summarise (n = n()) %>%
spread(model, n) %>%
mutate_if(is.integer, as.numeric)
new_mpg[,-1] %>%
mutate(sum = rowSums(.))
I'm looking to create the following output:
manufacturer | 4runner4wd | a4 | a4 quattro | a6 quattro | altima |
--------------------------------------------------------------------------
audi | NA | 0.3888889 | 0.444444 | 0.166667 | NA |
However, when I get to
new_mpg[,-1] %>%
mutate(sum = rowSums(.))
the sum column returns NA. And I'm unable to calculate the n()/sum. I will just get NA. Any ideas how to fix this?
As #camille mentioned in the comments you need an na.rm = TRUE in the rowSums call. To get the percentage of each model in the manufacturer you need to first count the number of each model grouped by manufacturer and model and then get the percentage grouped only by manufacturer. dplyr is smart in this way because it removes one layer of grouping after the summarise so you just need to add a mutate:
library(dplyr)
library(tidyr)
library(ggplot2)
new_mpg <- mpg %>%
group_by(manufacturer, model) %>%
summarise (n = n()) %>%
mutate(n = n/sum(n)) %>%
spread(model, n) %>%
mutate_if(is.integer, as.numeric)
new_mpg[,-1] %>%
mutate(sum = rowSums(., na.rm = TRUE))

computing differences between groups: alternative to spread for multiple computations

I commonly need to compute differences between groups, nested by some interval and/or additional grouping. For computing a single variable, this is easy to accomplish with spread and mutate. Here's a reproducible example with the datasetChickWeight; don't get distracted by the calculation itself (this is just a toy example), my question is about how to handle a dataset structured like the dataframe ChickSum created below.
# reproducible dataset
data(ChickWeight)
ChickSum = ChickWeight %>%
filter(Time == max(Time) | Time == min(Time)) %>%
group_by(Diet, Time) %>%
summarize(mean.weight = mean(weight)) %>%
ungroup()
Here is how I might go about calculating the change in average chick weight between the first and last time, stratified by diet:
# Compute change in mean weight between first and last time
ChickSum %>%
spread(Time, mean.weight) %>%
mutate(weight.change = `21` - `0`)
However, this doesn't work so well with multiple variables:
ChickSum2 = ChickWeight %>%
filter(Time == max(Time) | Time == min(Time)) %>%
group_by(Diet, Time) %>%
# now also compute variable "count"
summarize(count = n(), mean.weight = mean(weight)) %>%
ungroup()
I can't spread by Time and both count and mean.weight; my current solution is to do two spread-mutate operations---once for count and again for mean.weight---and then join the results.
ChickCountChange = ChickSum2 %>%
select(-mean.weight) %>%
spread(Time, count) %>%
mutate(count.change = `21` - `0`)
ChickWeightChange = ChickSum2 %>%
select(-count) %>%
spread(Time, mean.weight) %>%
mutate(weight.change = `21` - `0`)
full_join(
select(ChickWeightChange, Diet, weight.change),
select(ChickCountChange, Diet, count.change),
by = "Diet")
Is there another approach to these types of computation? I've been trying to conceive of a strategy that combines group_by and purrr::pmap in order to avoid spread but still maintain the advantages of the above approach (such as spread's fill argument for choosing how to handle missing group combinations), but I haven't figured it out. I'm open to suggestions or alternative data structures/ways of thinking about the problem.
You might try re-grouping, then using lag() to calculate the differences. Works for your toy example, but it may be better to see some of your real dataset:
ChickWeight %>%
filter(Time == max(Time) | Time == min(Time)) %>%
group_by(Diet, Time) %>%
# now also compute variable "count"
summarize(count = n(), mean.weight = mean(weight)) %>%
ungroup() %>%
group_by(Diet) %>%
mutate(count.change = count - lag(count),
weight.change = mean.weight - lag(mean.weight)) %>%
filter(Time == max(Time))
Result:
Diet Time count mean.weight count.change weight.change
<fct> <dbl> <int> <dbl> <int> <dbl>
1 1 21 16 178. -4 136.
2 2 21 10 215. 0 174
3 3 21 10 270. 0 230.
4 4 21 9 239. -1 198.
So I came up with a potential/partial solution in the process of writing up a reproducible example. Essentially, we use gather to group by the variables themselves:
ChickSum2 %>%
gather(variable, value, count, mean.weight) %>%
spread(Time, value) %>% mutate(Change = `21` - `0`) %>%
select(Diet, variable, Change) %>%
spread(variable, Change)
This works only if the following two conditions are true:
All variables are the same type (e.g. both mean.weight and count are numeric).
the difference calculation is the same for all variables (e.g. I want to compute last - first for all variables).
I guess the second condition could be relaxed by using e.g. case_when.

Use dplyr to substitute apply

I have table like this (but number of columns can be different, I have a number of pairs ref_* + alt_*):
+--------+-------+-------+-------+-------+
| GeneID | ref_a | alt_a | ref_b | alt_b |
+--------+-------+-------+-------+-------+
| a1 | 0 | 1 | 1 | 3 |
| a2 | 1 | 1 | 7 | 8 |
| a3 | 0 | 1 | 1 | 3 |
| a4 | 0 | 1 | 1 | 3 |
+--------+-------+-------+---------------+
and need to filter out rows that have ref_a + alt_a < 10 and ref_b + alt_b < 10. It's easy to do it with apply, creating additional columns and filtering, but I'm learning to keep my data tidy, so trying to do it with dplyr.
I would use mutate first to create columns with sums and then filter by these sums. But can't figure out how to use mutate in this case.
Edited:
Number of columns is not fixed!
You do not need to mutate here. Just do the following:
require(tidyverse)
df %>%
filter(ref_a + alt_a < 10 & ref_b + alt_b < 10)
If you want to use mutate first you could go with:
df %>%
mutate(sum1 = ref_a + alt_a, sum2 = ref_b + alt_b) %>%
filter(sum1 < 10 & sum2 < 10)
Edit: The fact that we don't know the number of variables in advance makes it a bit more complicated. However, I think you could use the following code to perform this task (assuming that the variable names are all formated with "_a", "_b" and so on. I hope there is a shorter way to perform this task :)
df$GeneID <- as.character(df$GeneID)
df %>%
gather(variable, value, -GeneID) %>%
rowwise() %>%
mutate(variable = unlist(strsplit(variable, "_"))[2]) %>%
ungroup() %>%
group_by(GeneID, variable) %>%
summarise(sum = sum(value)) %>%
filter(sum < 10) %>%
summarise(keepGeneID = ifelse(n() == (ncol(df) - 1)/2, TRUE, FALSE)) %>%
filter(keepGeneID == TRUE) %>%
select(GeneID) -> ids
df %>%
filter(GeneID %in% ids$GeneID)
Edit 2: After some rework I was able to improve the code a bit:
df$GeneID <- as.character(df$GeneID)
df %>%
gather(variable, value, -GeneID) %>%
rowwise() %>%
mutate(variable = unlist(strsplit(variable, "_"))[2]) %>%
ungroup() %>%
group_by(GeneID, variable) %>%
summarise(sum = sum(value)) %>%
group_by(GeneID) %>%
summarise(max = max(sum)) %>%
filter(max < 10) -> ids
df %>%
filter(GeneID %in% ids$GeneID)

Finding the largest year interval in R

Suppose I have 10 years time and name associated to it like following,
Name Year
A 1990
B 1991
C 1992
A 1993
A 1994
.
.
.
I want to find the name that has been out of use for the longest time.
Can anybody help me how to do this?
Using dplyr:
library(dplyr)
mutate(your_data, max_year = max(Year)) %>%
group_by(Name) %>%
summarize(most_recent = max(Year),
unused_length = first(max_year) - most_recent) %>%
ungroup() %>%
arrange(most_recent)
This will order the names by their most recent use, with the oldest most recent use first.
If you only care about getting that one most out-of-use name, you just need the first row of the result. Add slice(1) to the chain as so:
mutate(your_data, max_year = max(Year)) %>%
group_by(Name) %>%
summarize(most_recent = max(Year),
unused_length = first(max_year) - most_recent) %>%
ungroup() %>%
arrange(most_recent) %>%
slice(1)

Resources