R group_by %>% full_join losing NA records - r

Consider these two data frames:
t1<-data.frame(Time=1:3,Cat=rep("A",3),SomeValue=rep("t1",3))
t2<-data.frame(Time=c(1,2,3,1,3),Cat=rep("A",5),Id=c(1,1,1,2,2),SomeOtherValue=c(1,2,3,4,5))
In my application, I need to do a full join and work with missing records/values. Doing partial full_join on subsets (grouping var) works, but I lose my missing values when I try the unfiltered approach.
This will give me 6 records
t2 %>% group_by(Id) %>% filter(Id==2) %>% full_join(t1,by=c("Time","Cat"))
t2 %>% group_by(Id) %>% filter(Id==1) %>% full_join(t1,by=c("Time","Cat"))
This will give me 5, where the missing entry (NA values) of Id==2 and Time==2 is gone:
t2 %>% group_by(Id) %>% full_join(t1,by=c("Time","Cat"))
My understanding of group_by is that it groups by variable(s), and continues with all my following mutation,mapping etc on each group. Is it supposed to behave in this way?

After reading documentation properly, I finally found the section that states that groups are ignored for the purpose of joining. ?full_join

Related

Filter the first group after group_by

Sometimes it is handy to take a test case out of your data when working with group_by() from the dplyr library. I was wondering if there is any fast way to just grab the first group of a grouped dataframe and cast it to a new dataframe.
All I could come up with was this workaround:
library(dplyr)
smalldf <- mtcars %>% group_by(gear) %>% group_split(.) %>% .[[1]]

Collapsing scattered information across multiple variables into 1 in R

I have some table data that has been scattered across around 1000 variables in a dataset. Most are split across 2 variables, and I can piece together the data using coalesce, however this is pretty inefficient for some variables which are instead spread across >10. Is there are a better/more efficient way?
The syntax I have written so far is:
scattered_data <- df %>%
select(id, contains("MASS9A_E2")) %>%
#this brings in all the variables for this one question that start with this string
mutate(speciality = coalesce(MASS9A_E2_C4_1,MASS9A_E2_C4_2,MASS9A_E2_C4_3, MASS9A_E2_C4_4, MASS9A_E2_C4_5, MASS9A_E2_C4_6, MASS9A_E2_C4_7, MASS9A_E2_C4_8, MASS9A_E2_C4_9, MASS9A_E2_C5_1,MASS9A_E2_C5_2,MASS9A_E2_C5_3, MASS9A_E2_C5_4, MASS9A_E2_C5_5, MASS9A_E2_C5_6, MASS9A_E2_C5_7, MASS9A_E2_C5_8, MASS9A_E2_C5_9))
As I have this for 28 MASS questions and would really love to be able to collapse these down a bit quicker.
You can use do.call() to take all columns except id as input of coalesce().
library(dplyr)
df %>%
select(id, contains("MASS9A_E2")) %>%
mutate(speciality = do.call(coalesce, select(df, -id)))
In addition, you can call coalesce() iteratively by Reduce().
df %>%
select(id, contains("MASS9A_E2")) %>%
mutate(speciality = Reduce(coalesce, select(df, -id)))

Select last row within each group with dplyr is slow

I have the following R code. Essentially, I am asking R to arrange the dataset based on postcode and paon, then group them by id, and finally keep only the last row within each group. However, R requires more than 3 hours to do this.
I am not sure what I am doing wrong with my code since there is no for loop here.
epc2 is a vector with 324,368 rows.
epc3 <- epc2 %>%
arrange(postcode, paon) %>%
group_by(id) %>%
do(tail(., 1))
Thank you for any and all of your help.
How about:
mtcars %>%
arrange(cyl) %>%
group_by(cyl) %>%
slice(n())

Trying to understand dplyr function - group_by

I am trying to understand the way group_by function works in dplyr. I am using the airquality data set, that comes with the datasets package link.
I understand that is if I do the following, it should arrange the records in increasing order of Temp variable
airquality_max1 <- airquality %>% arrange(Temp)
I see that is the case in airquality_max1. I now want to arrange the records by increasing order of Temp but grouped by Month. So the end result should first have all the records for Month == 5 in increasing order of Temp. Then it should have all records of Month == 6 in increasing order of Temp and so on, so I use the following command
airquality_max2 <- airquality %>% group_by(Month) %>% arrange(Temp)
However, what I find is that the results are still in increasing order of Temp only, not grouped by Month, i.e., airquality_max1 and airquality_max2 are equal.
I am not sure why the grouping by Month does not happen before the arrange function. Can anyone help me understand what I am doing wrong here?
More than the problem of trying to sort the data frame by columns, I am trying to understand the behavior of group_by as I am trying to use this to explain the application of group_by to someone.
arrange ignores group_by, see break-changes on dplyr 0.5.0. If you need to order by two columns, you can do:
airquality %>% arrange(Month, Temp)
For grouped data frame, you can also .by_group variable to sort by the group variable first.
airquality %>% group_by(Month) %>% arrange(Temp, .by_group = TRUE)

Arrange a grouped_df by group variable not working

I have a data.frame that contains client names, years, and several revenue numbers from each year.
df <- data.frame(client = rep(c("Client A","Client B", "Client C"),3),
year = rep(c(2014,2013,2012), each=3),
rev = rep(c(10,20,30),3)
)
I want to end up with a data.frame that aggregates the revenue by client and year. I then want to sort the data.frame by year then by descending revenue.
library(dplyr)
df1 <- df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
arrange(year, desc(tot))
However, when using the code above the arrange() function doesn't change the order of the grouped data.frame at all. When I run the below code and coerce to a normal data.frame it works.
library(dplyr)
df1 <- df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
data.frame() %>%
arrange(year, desc(tot))
Am I missing something or will I need to do this every time when trying to arrange a grouped_df by a grouped variable?
R Version: 3.1.1
dplyr package version: 0.3.0.2
EDIT 11/13/2017:
As noted by lucacerone, beginning with dplyr 0.5, arrange once again ignores groups when sorting. So my original code now works in the way I initially expected it would.
arrange() once again ignores grouping, reverting back to the behaviour of dplyr 0.3 and earlier. This makes arrange() inconsistent with other dplyr verbs, but I think this behaviour is generally more useful. Regardless, it’s not going to change again, as more changes will just cause more confusion.
Try switching the order of your group_by statement:
df %>%
group_by(year, client) %>%
summarise(tot = sum(rev)) %>%
arrange(year, desc(tot))
I think arrange is ordering within groups; after summarize, the last group is dropped, so this means in your first example it's arranging rows within the client group. Switching the order to group_by(year, client) seems to fix it because the client group gets dropped after summarize.
Alternatively, there is the ungroup() function
df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
ungroup() %>%
arrange(year, desc(tot))
Edit, #lucacerone: since dplyr 0.5 this does not work anymore:
Breaking changes arrange() once again ignores grouping, reverting back
to the behaviour of dplyr 0.3 and earlier. This makes arrange()
inconsistent with other dplyr verbs, but I think this behaviour is
generally more useful. Regardless, it’s not going to change again, as
more changes will just cause more confusion.
Latest versions of dplyr (at least from dplyr_0.7.4) allow to arrange within groups. You just have so set into the arrange() call .by_group = TRUE. More information is available here
In your example, try:
library(dplyr)
df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
arrange(desc(tot), .by_group = TRUE)

Resources