Get last row of each group in R [duplicate] - r

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 4 years ago.
I have some data similar in structure to:
a <- data.frame("ID" = c("A", "A", "B", "B", "C", "C"),
"NUM" = c(1, 2, 4, 3, 6, 9),
"VAL" = c(1, 0, 1, 0, 1, 0))
And I am trying to sort it by ID and NUM then get the last row.
This code works to get the last row and summarize down to a unique ID, however, it doesn't actually get the full last row like I want.
a <- a %>% arrange(ID, NUM) %>%
group_by(ID) %>%
summarise(max(NUM))
I understand why this code doesn't work but am looking for the dplyr way of getting the last row for each unique ID
Expected Results:
ID NUM VAL
<fct <dbl> <dbl>
1 A 2 0
2 B 4 1
3 C 9 0
Note: I will admit that though it is nearly a duplicate of Select first and last row from grouped data, the answers on that thread were not quite what I was looking for.

You might try:
a %>%
group_by(ID) %>%
arrange(NUM) %>%
slice(n())

One dplyr option could be:
a %>%
arrange(ID, NUM) %>%
group_by(ID) %>%
summarise_all(last)
ID NUM VAL
<fct> <dbl> <dbl>
1 A 2. 0.
2 B 4. 1.
3 C 9. 0.
Or since dplyr 1.0.0:
a %>%
arrange(ID, NUM) %>%
group_by(ID) %>%
summarise(across(everything(), last))
Or using slice_max():
a %>%
group_by(ID) %>%
slice_max(order_by = NUM, n = 1)

tail() returns the last 6 items of a subsettable object. When using aggregate(), the parameters to the FUN argument are passed immediately after the function using a comma; here 1 refers to n = 1, which tells tail() to only return the last item.
aggregate(a[, c('NUM', 'VAL')], list(a$ID), tail, 1)
# Group.1 NUM VAL
# 1 A 2 0
# 2 B 3 0
# 3 C 9 0

You can use top_n. (grouping already sorts by ID, and sorting by NUM isn't necessary since there's only 1 value)
library(dplyr)
a %>%
group_by(ID) %>%
top_n(1, NUM)
# # A tibble: 3 x 3
# # Groups: ID [3]
# ID NUM VAL
# <fct> <dbl> <dbl>
# 1 A 2 0
# 2 B 4 1
# 3 C 9 0

Related

How to count the number of times a specified variable appears in a dataframe column using dplyr?

Suppose we start with this very simple dataframe called myData:
> myData
Element Class
1 A 0
2 A 0
3 C 0
4 A 0
5 B 1
6 B 1
7 A 2
Generated by:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
How would I use dplyr to extract the number of times "A" appears in the Element column of the myData dataframe? I would simply like the number 4 returned, for further processing in dplyr. All I have so far is the dplyr code shown at the bottom, which seems clumsy because among other things it yields another dataframe with more information than just the number 4 that is needed:
# A tibble: 1 x 2
Element counted
<chr> <int>
1 A 4
The dplyr code that produces the above tibble:
library(dplyr)
myData %>% group_by(Element) %>% filter(Element == "A") %>% summarise(counted = n())
We can use count which simplifies the group_by + summarise step
library(dplyr)
myData %>%
filter(Element == 'A') %>%
count(Element, name = 'counted')
Or with just summarise and sum
myData %>%
summarise(counted = sum(Element == 'A'), Element = 'A') %>%
relocate(Element, .before = 1)
Element counted
1 A 4
Another option using tally like this:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
library(dplyr)
myData %>%
filter(Element == "A") %>%
group_by(Element) %>%
tally()
#> # A tibble: 1 × 2
#> Element n
#> <chr> <int>
#> 1 A 4
Created on 2022-07-28 by the reprex package (v2.0.1)

Interpolation of values from list

I have a dataframe containing the results of a competition. In this example competitors b and c have tied for second place. The actual dataframe is very large and could contain multiple ties.
df <- data.frame(name = letters[1:4],
place = c(1, 2, 2, 4))
I also have point values for the respective places, where first place gets 4 points, 2nd gets 3, 3rd gets 1 and 4th gets 0.
points <- c(4, 3, 1, 0)
names(points) <- 1:4
I can match points to place to get each competitor's score
df %>%
mutate(score = points[place])
name place score
1 a 1 4
2 b 2 3
3 c 2 3
4 d 4 0
What I would like to do though is award points to b and c that are the mean of the point values for 2nd and 3rd, such that each receives 2 points like this:
name place score
1 a 1 4
2 b 2 2
3 c 2 2
4 d 4 0
How can I accomplish this programmatically?
A solution using nested data frames and purrr.
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(name = letters[1:4],
place = c(1, 2, 2, 4))
points <- c(4, 3, 1, 0)
names(points) <- 1:4
# a function to help expand the dataframe based on the number of ties
expand_all <- function(x,n){
x:(x+n-1)
}
df %>%
group_by(place) %>%
tally() %>%
mutate(new_place = purrr::map2(place,n, expand_all)) %>%
unnest(new_place) %>%
mutate(score = points[new_place]) %>%
group_by(place) %>%
summarize(score = mean(score)) %>%
inner_join(df)
Robert Wilson's answer gave me an idea. Rather than mapping over nested dataframes the rank function from base can get to the same result
df %>%
mutate(new_place = rank(place, ties.method = "first")) %>%
mutate(score = points[new_place]) %>%
group_by(place) %>%
summarize(score = mean(score)) %>%
inner_join(df)
place score name
<dbl> <dbl> <chr>
1 1 4 a
2 2 2 b
3 2 2 c
4 4 0 d
This can be accomplished in few lines with an ifelse() statement inside of a mutate():
df %>%
group_by(place) %>%
mutate(n_ties = n()) %>%
ungroup %>%
mutate(score = (points[place] + ifelse(n_ties > 1, 1, 0))/ n_ties)
# A tibble: 4 x 4
name place n_ties score
<chr> <dbl> <int> <dbl>
1 a 1 1 4
2 b 2 2 2
3 c 2 2 2
4 d 4 1 0

How to find duplicates based on values in 2 columns but also the groupings by another column in R?

I have a dataset with 3 columns: ID, value a, and value b. I want to group the dataset based on the values in the ID column and then identify duplicates that have identical data in the value a and b columns between the different groupings.
I know that I can use the dplyr package and data %>% group_by (ID) to group my dataset based on the ID column. I also know that I can use data[duplicated(data[,2:3]),] to return all rows with duplicate data in rows 2 (value a) and 3 (value b).
However, I would like a function that can only finds duplicates between different ID groups instead of just duplicates within the whole dataset. I've tried combining group_by and duplicated, but it doesn't return the correct results. Which function would do this?
It was a little unclear if you wanted to return:
only the distinct rows
single examples of duplicated rows
all duplicated rows
So here are some options:
library(dplyr)
library(readr)
"ID,a,b
1, 1, 1
1, 1, 1
1, 1, 2
2, 1, 1
2, 1, 2" %>%
read_csv() -> exp_dat
# return only distinct rows
exp_dat %>%
distinct(ID, a, b)
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 2
# 3 2 1 1
# 4 2 1 2
# return single examples of duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 1 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# return all duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
add_count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 2 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 1

rollsumr with window-length>1: filling missing values

My data frame looks something like the first two columns of the following
I want to add a third column, equal to the sum of the ID-group's last three observations for VAL.
Using the following command, I managed to get the output below:
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3)) %>%
ungroup()
ID VAL SUM
1 2 NA
1 1 NA
1 3 6
1 4 8
...
I am now hoping to be able to fill the NAs that result for the group's cells in the first two rows.
ID VAL SUM
1 2 2
1 1 3
1 3 6
1 4 8
...
How do I do that?
I have tried doing the following
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=min(3, row_number())) %>%
ungroup()
and
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3), fill = "extend") %>%
ungroup()
But both give me the same error, because I have groups of sizes <= 2.
Evaluation error: need at least two non-NA values to interpolate.
What do I do?
Alternatively, you can use rollapply() from the same package:
df %>%
group_by(ID) %>%
mutate(SUM = rollapply(VAL, width = 3, FUN = sum, partial = TRUE, align = "right"))
ID VAL SUM
<int> <int> <int>
1 1 2 2
2 1 1 3
3 1 3 6
4 1 4 8
Due to argument partial = TRUE, also the rows that are below the desired window of length three are summed.
Not a direct answer but one way would be to replace the values which are NAs with cumsum of VAL
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(is.na(SUM), cumsum(VAL), SUM))
# ID VAL SUM
# <int> <int> <int>
#1 1 2 2
#2 1 1 3
#3 1 3 6
#4 1 4 8
Or since you know the window size before hand, you could check with row_number() as well
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(row_number() < 3, cumsum(VAL), SUM))

How can we apply tidyr:: spread() to all categorical variables at once creating new columns for each level of each categorical variable? [duplicate]

This question already has answers here:
Using Group_by create aggregated counts conditional on value
(1 answer)
reshape of a large data
(2 answers)
Aggregating factor level counts - by factor
(1 answer)
Closed 4 years ago.
I have a dataframe with 3 categorical variables (x,y,z) along with an ID column :
df <- frame_data(
~id, ~x, ~y, ~z,
1, "a", "c" ,"v",
1, "b", "d", "f",
2, "a", "d", "v",
2, "b", "d", "v")
I want to apply spread() to each of the categorical variables group by ID .
Output should be like this :
id a b c d v f
1 1 1 1 1 1 1
2 1 1 0 2 2 0
I tried doing it but I was able to do it only for one variable at once not all together .
For e.g: Applying spread only to the y column (similarly , it can be done for x and z separately) but not together in a single line
df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1.00 1 1
2.00 0 2
Explaining my codes in three steps:
Step 1: count frequency
df %>% count(id,y)
id y n
<dbl> <chr> <int>
1.00 c 1
1.00 d 1
2.00 d 2
Step 2 : applying spread()
df %>% count(id,y) %>% spread(y,n)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1 1.00 1 1
2 2.00 NA 2
Step 3: Adding fill = 0 , replaces NA which means there was zero occurrence of c in y column for id 2 (as you can see in df)
df %>% count(id,y) %>% spread(y,n,fill=0)
# A tibble: 2 x 3
id c d
<dbl> <int> <int>
1.00 1 1
2.00 0 2
Problem : In my actual data set , I have 20 such categorical variables , I can't do it one by one for all. I am looking to do it all at once.
Is it possible apply spread() in tidyr for all of categorical variables all together ? If not can you please suggest an alternative
Note: I also gave a try to these answers but were not helpful for this particular case:
R spreading multiple columns with tidyr
Is it possible to use spread on multiple columns in tidyr similar to dcast?
Can spread() in tidyr spread across multiple value?
Expanding columns associated with a categorical variable into multiple columns with dplyr/tidyr while retaining id variable
Additional related helpful question :
It is possible that two categorical columns (Eg: Survey dataset) have same values . Like below.
df <- frame_data(
~id, ~Do_you_Watch_TV, ~Do_you_Drive,
1, "yes", "yes",
1, "yes", "no",
2, "yes", "no",
2, "no", "yes")
# A tibble: 4 x 3
id Do_you_Watch_TV Do_you_Drive
<dbl> <chr> <chr>
1 1.00 yes yes
2 1.00 yes no
3 2.00 yes no
4 2.00 no yes
Running the below code would not differentiate counts of yes and no for 'Do_you_Watch_TV', 'Do_you_Drive' :
df %>% gather(Key, value, -id) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
id no yes
1 1 3
2 2 2
Whereas, expected output should be :
id Do_you_Watch_TV_no Do_you_Watch_TV_yes Do_you_Drive_no Do_you_Drive_yes
1 0 2 1 1
2 1 1 1 1
So , we need to treat No and Yes from Do_you_Watch_TV and Do_you_Drive separately by adding prefix. Do_you_Drive_yes , Do_you_Drive_no , Do_you_Watch_TV _yes, Do_you_Watch_TV _no .
How can we achieve this?
Thanks
First you need to convert your data frame in long format before you can actually transform it in wide format. Hence, first you need to use tidyr::gather and convert data frame to long format. Afterwards, you have couple of options:
Option#1: Using tidyr::spread:
#data
df <- frame_data(
~id, ~x, ~y, ~z,
1, "a", "c" ,"v",
1, "b", "d", "f",
2, "a", "d", "v",
2, "b", "d", "v")
library(tidyverse)
df %>% gather(Key, value, -id) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
# id a b c d f v
# 1 1 1 1 1 1 1 1
# 2 2 1 1 0 2 0 2
Option#2: Another option can be is to use reshape2::dcast as:
library(tidyverse)
library(reshape2)
df %>% gather(Key, value, -id) %>%
dcast(id~value, fun.aggregate = length)
# id a b c d f v
# 1 1 1 1 1 1 1 1
# 2 2 1 1 0 2 0 2
Edited: To include solution for 2nd data frame.
#Data
df1 <- frame_data(
~id, ~Do_you_Watch_TV, ~Do_you_Drive,
1, "yes", "yes",
1, "yes", "no",
2, "yes", "no",
2, "no", "yes")
library(tidyverse)
df1 %>% gather(Key, value, -id) %>% unite("value", c(Key, value)) %>%
group_by(id, value) %>%
summarise(count = n()) %>%
spread(value, count, fill = 0) %>%
as.data.frame()
# id Do_you_Drive_no Do_you_Drive_yes Do_you_Watch_TV_no Do_you_Watch_TV_yes
# 1 1 1 1 0 2
# 2 2 1 1 1 1

Resources