Select last row within each group with dplyr is slow - r

I have the following R code. Essentially, I am asking R to arrange the dataset based on postcode and paon, then group them by id, and finally keep only the last row within each group. However, R requires more than 3 hours to do this.
I am not sure what I am doing wrong with my code since there is no for loop here.
epc2 is a vector with 324,368 rows.
epc3 <- epc2 %>%
arrange(postcode, paon) %>%
group_by(id) %>%
do(tail(., 1))
Thank you for any and all of your help.

How about:
mtcars %>%
arrange(cyl) %>%
group_by(cyl) %>%
slice(n())

Related

Filter the first group after group_by

Sometimes it is handy to take a test case out of your data when working with group_by() from the dplyr library. I was wondering if there is any fast way to just grab the first group of a grouped dataframe and cast it to a new dataframe.
All I could come up with was this workaround:
library(dplyr)
smalldf <- mtcars %>% group_by(gear) %>% group_split(.) %>% .[[1]]

Tidyverse: Unnest tibble by group into seperate df/tibble

I am currently trying to find a short & tidy way to unnest a nested tibble with 2 grouping variables and a tibble/df as data for each observation into a tibble having only one of the grouping variables and the respective data in a df (or tibble). I will illustrate my sample by using the starwars dataset provided by tidyverse and show the 3 solutions I came up with so far.
library(tidyverse)
#Set example data: 2 grouping variables, name & sex, and one data column with a tibble/df for each observation
tbl_1 <- starwars %>% group_by(name, sex) %>% nest() %>% ungroup()
#1st Solution: Neither short nor tidy but gets me the result I would like to have in the end
tbl_2 <- tbl_1 %>%
group_by(sex) %>%
nest() %>%
ungroup()%>%
mutate(data = map(.$data, ~.x %>% group_by(name) %>% unnest(c(data))))
#2nd Solution: A lot shorter and more neat but still not what I have in mind
tbl_2 <- tbl_1 %>%
nest(-sex) %>%
mutate(data = map(.$data, ~.x %>% unnest(cols = c(data))))
#3rd Solution: The best so far, short and readable
tbl_2 <- tbl_1 %>%
unnest(data) %>%
group_by(name) %>%
nest(-sex)
##Solution as I have it in mind / I think should be somehow possible.
tbl_2 <- tbl_1 %>% group_by(sex) %>% unnest() #This however gives one large tibble grouped by sex, not two separate tibbles in a nested tibble
Is such a solution I am looking for even possible in the first place or is the 3rd solution as close as it gets in terms of being both short, readable and tidy?
In terms of my actual workflow tbl_1 is the "work horse" of my analysis and not subject to change, I use to apply analysis or ggplot via map for figures etc., which are sometimes on the level of "names" or "sex".
I appreciate any input!
Update:
User #caldwellst has given a sufficient enough answer for me to mark this question as answered, unfortunately only as a comment. After waiting a bit, I would now accept any other answer with the same suggestion as the solution to mark this question as solved.
As #caldwellst has pointed out in a comment, the group_by is unnecessary, the provided solution is sufficiently short and tidy enough for me in that case.
tbl_1 %>% unnest(data) %>% nest(data = -sex).
I will remove my answer and accept a different one, if #caldwellst posts the comment as answer or somebody else provides a different, but equally suitable one.

R group_by %>% full_join losing NA records

Consider these two data frames:
t1<-data.frame(Time=1:3,Cat=rep("A",3),SomeValue=rep("t1",3))
t2<-data.frame(Time=c(1,2,3,1,3),Cat=rep("A",5),Id=c(1,1,1,2,2),SomeOtherValue=c(1,2,3,4,5))
In my application, I need to do a full join and work with missing records/values. Doing partial full_join on subsets (grouping var) works, but I lose my missing values when I try the unfiltered approach.
This will give me 6 records
t2 %>% group_by(Id) %>% filter(Id==2) %>% full_join(t1,by=c("Time","Cat"))
t2 %>% group_by(Id) %>% filter(Id==1) %>% full_join(t1,by=c("Time","Cat"))
This will give me 5, where the missing entry (NA values) of Id==2 and Time==2 is gone:
t2 %>% group_by(Id) %>% full_join(t1,by=c("Time","Cat"))
My understanding of group_by is that it groups by variable(s), and continues with all my following mutation,mapping etc on each group. Is it supposed to behave in this way?
After reading documentation properly, I finally found the section that states that groups are ignored for the purpose of joining. ?full_join

counting grouped missing values in R

Sorry if this question is fairly simple
I am new to R and I want to count by group the number of missing values in the column some_column, which are in my dataset replaces by 0 values, and then get the group which has maximum of 0 values. So did this (using package dplyr):
missing_data <- group_by(some_data,some_group, count=sum(some_column==0))
But what is weird is that I get in the count column the same number all along the dataset as if the dataset was not grouped. Someone has an idea
Ok I got it
missing_data %>% group_by(some_group) %>% summarise(count=sum(some_column==0))
Keeping with dplyr verbs:
missing_data <- filter(some_data, some_column == 0) %>%
group_by(some_group) %>%
summarise(count = n()) %>%
arrange(desc(count))
Here an example using mtcars dataframe
count_zero<-function(x){
sum(x==0,na.rm=TRUE)
}
aggregate(mtcars,list(cyl=(mtcars$cyl)),count_zero)
here's finally the answer
missing_data %>% group_by(some_group) %>% summarise(count=sum(some_column==0)) %>% arrange(desc(count))

What does n=n( ) mean in R?

The other day I was reading the following lines in R and I don't understand what the %>% and summarise(n=n()) and summarise(total=n()) meant. I understand the group_by and ungroup methods though.
Can someone help out? There isn't any documentation for this either.
library(dplyr)
net.multiplicity <- group_by(net, nodeid, epoch) %>% summarise(n=n()) %>%
ungroup() %>% group_by(n) %>% summarise(total=n())
This is from the dplyr package. n=n() means that a variable named n will be assigned the number of rows (think number of observations) in the summarized data.
the %>% is read as "and then" and is way of listing your functions sequentially rather then nesting them. So that command is saying you should do the grouping and then summarize the result of the grouping by the number of rows in each group and then ungroup that result, and then group the un-grouped data based on n and then summarize that by the total number of rows in each of the new groups.

Resources