Arrange a grouped_df by group variable not working - r

I have a data.frame that contains client names, years, and several revenue numbers from each year.
df <- data.frame(client = rep(c("Client A","Client B", "Client C"),3),
year = rep(c(2014,2013,2012), each=3),
rev = rep(c(10,20,30),3)
)
I want to end up with a data.frame that aggregates the revenue by client and year. I then want to sort the data.frame by year then by descending revenue.
library(dplyr)
df1 <- df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
arrange(year, desc(tot))
However, when using the code above the arrange() function doesn't change the order of the grouped data.frame at all. When I run the below code and coerce to a normal data.frame it works.
library(dplyr)
df1 <- df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
data.frame() %>%
arrange(year, desc(tot))
Am I missing something or will I need to do this every time when trying to arrange a grouped_df by a grouped variable?
R Version: 3.1.1
dplyr package version: 0.3.0.2
EDIT 11/13/2017:
As noted by lucacerone, beginning with dplyr 0.5, arrange once again ignores groups when sorting. So my original code now works in the way I initially expected it would.
arrange() once again ignores grouping, reverting back to the behaviour of dplyr 0.3 and earlier. This makes arrange() inconsistent with other dplyr verbs, but I think this behaviour is generally more useful. Regardless, it’s not going to change again, as more changes will just cause more confusion.

Try switching the order of your group_by statement:
df %>%
group_by(year, client) %>%
summarise(tot = sum(rev)) %>%
arrange(year, desc(tot))
I think arrange is ordering within groups; after summarize, the last group is dropped, so this means in your first example it's arranging rows within the client group. Switching the order to group_by(year, client) seems to fix it because the client group gets dropped after summarize.
Alternatively, there is the ungroup() function
df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
ungroup() %>%
arrange(year, desc(tot))
Edit, #lucacerone: since dplyr 0.5 this does not work anymore:
Breaking changes arrange() once again ignores grouping, reverting back
to the behaviour of dplyr 0.3 and earlier. This makes arrange()
inconsistent with other dplyr verbs, but I think this behaviour is
generally more useful. Regardless, it’s not going to change again, as
more changes will just cause more confusion.

Latest versions of dplyr (at least from dplyr_0.7.4) allow to arrange within groups. You just have so set into the arrange() call .by_group = TRUE. More information is available here
In your example, try:
library(dplyr)
df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
arrange(desc(tot), .by_group = TRUE)

Related

Filtering by Numerical Variable but Need to Satisfy Multiple Categorical Groups

I'm working with a modified version of the babynames dataset, which can be gotten by installing the babynames packages and calling:
# to install the package
install.packages('babynames')
# to load the package
library(babynames)
# to get the only one dataframe of interest from the package
babynames <- babynames::babynames
# the modified data that I'm working with
babynames_prop_sexes <- babynames %>%
select(-prop, -year) %>%
group_by(name, sex) %>%
mutate(total_occurence = sum(n))
I need to sort out names that have more than 10000 occurrences for both sexes. How can I approach this? (Preferably by using dplyr but any method is welcomed.)
Thanks in advance for any help!
There might be a more elegant solution. But this should get you a list of names that appear with > 10000 entries as both an M and an F.
For the method, I just kept going with dplyr verbs. After using filter to get rid of the entries that appear < 10000 times, I could then group_by the name and use tally(), knowing that n = 2 when that entry appeared twice, once for M and once for F.
large_total_both_genders_same_name <- babynames %>%
group_by(name, sex) %>%
summarize(total = sum(n)) %>%
filter(total > 10000) %>%
arrange(name) %>%
group_by(name) %>%
tally() %>%
arrange(desc(n)) %>%
filter(n == 2) %>%
dplyr::select(name)
And if you want to filter your original file by that shortlist of names you can use a semi_join on the table we created, to shorten up the list. In this case, it wouldn't be obvious what you are looking at unless you also included the year column, which you removed.
original_babynames_shortened <- babynames_prop_sexes %>%
filter(name %in% large_total_both_genders_same_name$name)
But anyway, this is a common process. Create a summary table of some kind that is saved as its own 'intermediary' table, so to speak, then join that to your original, as a filter. Sometimes this can all be done in one go, but it's often easier, in my opinion to break this into two pieces.

Tidyverse: Unnest tibble by group into seperate df/tibble

I am currently trying to find a short & tidy way to unnest a nested tibble with 2 grouping variables and a tibble/df as data for each observation into a tibble having only one of the grouping variables and the respective data in a df (or tibble). I will illustrate my sample by using the starwars dataset provided by tidyverse and show the 3 solutions I came up with so far.
library(tidyverse)
#Set example data: 2 grouping variables, name & sex, and one data column with a tibble/df for each observation
tbl_1 <- starwars %>% group_by(name, sex) %>% nest() %>% ungroup()
#1st Solution: Neither short nor tidy but gets me the result I would like to have in the end
tbl_2 <- tbl_1 %>%
group_by(sex) %>%
nest() %>%
ungroup()%>%
mutate(data = map(.$data, ~.x %>% group_by(name) %>% unnest(c(data))))
#2nd Solution: A lot shorter and more neat but still not what I have in mind
tbl_2 <- tbl_1 %>%
nest(-sex) %>%
mutate(data = map(.$data, ~.x %>% unnest(cols = c(data))))
#3rd Solution: The best so far, short and readable
tbl_2 <- tbl_1 %>%
unnest(data) %>%
group_by(name) %>%
nest(-sex)
##Solution as I have it in mind / I think should be somehow possible.
tbl_2 <- tbl_1 %>% group_by(sex) %>% unnest() #This however gives one large tibble grouped by sex, not two separate tibbles in a nested tibble
Is such a solution I am looking for even possible in the first place or is the 3rd solution as close as it gets in terms of being both short, readable and tidy?
In terms of my actual workflow tbl_1 is the "work horse" of my analysis and not subject to change, I use to apply analysis or ggplot via map for figures etc., which are sometimes on the level of "names" or "sex".
I appreciate any input!
Update:
User #caldwellst has given a sufficient enough answer for me to mark this question as answered, unfortunately only as a comment. After waiting a bit, I would now accept any other answer with the same suggestion as the solution to mark this question as solved.
As #caldwellst has pointed out in a comment, the group_by is unnecessary, the provided solution is sufficiently short and tidy enough for me in that case.
tbl_1 %>% unnest(data) %>% nest(data = -sex).
I will remove my answer and accept a different one, if #caldwellst posts the comment as answer or somebody else provides a different, but equally suitable one.

Using summarise alters the result of my computation

I have the following code, where I don't pipe through the summarise
library(tidyverse)
library(nycflights13)
depArrDelay <- flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")), all_vars(!is.na(.))) %>%
group_by(dep_delay, arr_delay)
Now doing
cor(depArrDelay$dep_delay, depArrDelay$arr_delay) yields 0.9148028 which is the correct value for my calculation
Now I add the %>% summarise (...) as seen below
depArrDelay <- flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")), all_vars(!is.na(.))) %>%
group_by(dep_delay, arr_delay) %>% summarise(count=n())
Now doing: cor(depArrDelay$dep_delay, depArrDelay$arr_delay) yields 0.9260394
So now the cov is altered. Why is this happening? From what I know, summarise should only through away all other columns that are not mentioned, and not alter value. Have I missed something, and how can I avoid that summarise alters the cov?
As already mentioned in the comments, summarise reduces the number of rows. If you need the count without changing number of rows, you can use add_count.
library(nycflights13)
library(dplyr)
temp <- flights %>%
filter_at(vars(c(dep_delay, arr_delay, distance)), all_vars(!is.na(.))) %>%
add_count(dep_delay, arr_delay)
If you then check for correlation you get the same value as earlier.
cor(temp$dep_delay, temp$arr_delay)
#[1] 0.9148027589
If there are more number of columns and you need only limited columns for your analysis, you can select relevant columns using select.

R group_by %>% full_join losing NA records

Consider these two data frames:
t1<-data.frame(Time=1:3,Cat=rep("A",3),SomeValue=rep("t1",3))
t2<-data.frame(Time=c(1,2,3,1,3),Cat=rep("A",5),Id=c(1,1,1,2,2),SomeOtherValue=c(1,2,3,4,5))
In my application, I need to do a full join and work with missing records/values. Doing partial full_join on subsets (grouping var) works, but I lose my missing values when I try the unfiltered approach.
This will give me 6 records
t2 %>% group_by(Id) %>% filter(Id==2) %>% full_join(t1,by=c("Time","Cat"))
t2 %>% group_by(Id) %>% filter(Id==1) %>% full_join(t1,by=c("Time","Cat"))
This will give me 5, where the missing entry (NA values) of Id==2 and Time==2 is gone:
t2 %>% group_by(Id) %>% full_join(t1,by=c("Time","Cat"))
My understanding of group_by is that it groups by variable(s), and continues with all my following mutation,mapping etc on each group. Is it supposed to behave in this way?
After reading documentation properly, I finally found the section that states that groups are ignored for the purpose of joining. ?full_join

Better output with dplyr -- breaking functions and results

This is a long-lasting question, but now I really to solve this puzzle. I'm using dplyr all the time and I think it is great to summarise variables. However, I'm trying to display a pivot table with partial success only. Dplyr always reports one single row with all results, what's annoying. I have to copy-paste the results to excel to organize everything...
I got the code here
and it almost working.
This result
Should be like the following one:
Because I always report my results using this style
Use this code to get the same results:
library(tidyverse)
set.seed(123)
ds <- data.frame(group=c("american", "canadian"),
iq=rnorm(n=50,mean=100,sd=15),
income=rnorm(n=50, mean=1500, sd=300),
math=rnorm(n=50, mean=5, sd=2))
ds %>%
group_by(group) %>%
summarise_at(vars(iq, income, math),funs(mean, sd)) %>%
t %>%
as.data.frame %>%
rownames_to_column %>%
separate(rowname, into = c("feature", "fun"), sep = "_")
To clarify, I've tried this code, but spread works with only one summary (mean or sd, etc). Some people use gather(), but it's complicated to work with group_by and gather().
Thanks for any help.
Instead of transposing (t) and changing the class types, after the summarise step, do a gather to change it to 'long' format and then spread it back after doing some modifications with separate and unite
library(tidyverse)
ds %>%
group_by(group) %>%
summarise_at(vars(iq, income, math),funs(mean, sd)) %>%
gather(key, val, iq_mean:math_sd) %>%
separate(key, into = c('key1', 'key2')) %>%
unite(group, group, key2) %>%
spread(group, val)

Resources