Select second largest row by group in r - r

I have this problem
library(dplyr)
problem = data.frame(id = c(1,1,1,2,2,2), var1 = c(5,4,3, 6,5,4), var2 = c(99,12,32,88,9,8))
For each id, I want to only keep row with second largest value of var1. I tried different ways (dplyr, base R):
problem %>%
group_by(id) %>%
slice_tail(2, -var1)
problem[with(problem, ave(var1, id, FUN = function(x) x == tail(sort(x), 2)[1])), ]
First code doesn;t work, second code gives wrong answer.
What am I doing wrong?

problem |> group_by(id) %>% arrange(var1) %>% slice(n()-1)
n() counts the number of rows in each group. slice(n()-1) takes the n-1th element. Note this will cause issues with groups with fewer than 2 members - you may wish to allow for that.

If you wish to use slice, I guess you can first slice_max() the largest two rows, than slice_tail to remove the largest row.
library(dplyr)
problem %>%
group_by(id) %>%
slice_max(var1, n = 2) %>%
slice_tail(n = 1)
Or you can use a single filter:
problem %>% group_by(id) %>% filter(var1 == max(var1[var1 != max(var1)]))
Output
# A tibble: 2 × 3
# Groups: id [2]
id var1 var2
<dbl> <dbl> <dbl>
1 1 4 12
2 2 5 9

In case you have volume, here is a data.tableapproach.
problem = data.frame(id = c(1,1,1,2,2,2), var1 = c(5,4,3, 6,5,4), var2 = c(99,12,32,88,9,8))
setDT(problem)
setorder(problem, id, - var1)
problem[, .SD[2], by=id]
As for #paul Stafford Allen comment, you will have issue for groups of size only 1.

After arrangeing the 'var1' on descending use slice with 2
library(dplyr)
problem %>%
arrange(id, desc(var1)) %>%
group_by(id) %>%
slice(2) %>%
ungroup
-output
# A tibble: 2 × 3
id var1 var2
<dbl> <dbl> <dbl>
1 1 4 12
2 2 5 9

Related

How to count the number of times a specified variable appears in a dataframe column using dplyr?

Suppose we start with this very simple dataframe called myData:
> myData
Element Class
1 A 0
2 A 0
3 C 0
4 A 0
5 B 1
6 B 1
7 A 2
Generated by:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
How would I use dplyr to extract the number of times "A" appears in the Element column of the myData dataframe? I would simply like the number 4 returned, for further processing in dplyr. All I have so far is the dplyr code shown at the bottom, which seems clumsy because among other things it yields another dataframe with more information than just the number 4 that is needed:
# A tibble: 1 x 2
Element counted
<chr> <int>
1 A 4
The dplyr code that produces the above tibble:
library(dplyr)
myData %>% group_by(Element) %>% filter(Element == "A") %>% summarise(counted = n())
We can use count which simplifies the group_by + summarise step
library(dplyr)
myData %>%
filter(Element == 'A') %>%
count(Element, name = 'counted')
Or with just summarise and sum
myData %>%
summarise(counted = sum(Element == 'A'), Element = 'A') %>%
relocate(Element, .before = 1)
Element counted
1 A 4
Another option using tally like this:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
library(dplyr)
myData %>%
filter(Element == "A") %>%
group_by(Element) %>%
tally()
#> # A tibble: 1 × 2
#> Element n
#> <chr> <int>
#> 1 A 4
Created on 2022-07-28 by the reprex package (v2.0.1)

Creating counts of subset with dplyr

I'm trying to summarize a data set with not only total counts per group, but also counts of subsets. So starting with something like this:
df <- data.frame(
Group=c('A','A','B','B','B'),
Size=c('Large','Large','Large','Small','Small')
)
df_summary <- df %>%
group_by(Group) %>%
summarize(group_n=n())
I can get a summary of the number of observations for each group:
> df_summary
# A tibble: 2 x 2
Size size_n
<chr> <int>
1 Large 3
2 Small 2
Is there anyway I can add some sort of subsetting information to n() to get, say, a count of how many observations per group were Large in this example? In other words, ending up with something like:
Group group_n Large_n
1 A 2 2
2 B 3 1
Thank you!
We could use count:
count(xyz) is the same as group_by(xyz) %>% summarise(xyz = n())
library(dplyr)
df %>%
count(Group, Size)
Group Size n
1 A Large 2
2 B Large 1
3 B Small 2
OR
library(dplyr)
library(tidyr)
df %>%
count(Group, Size) %>%
pivot_wider(names_from = Size, values_from = n)
Group Large Small
<chr> <int> <int>
1 A 2 NA
2 B 1 2
I approach this problem using an ifelse and a sum:
df_summary <- df %>%
group_by(Group) %>%
summarize(group_n=n(),
Large_n = sum(ifelse(Size == "Large", 1, 0)))
The last line turns Size into a binary indicator taking the value 1 if Size == "Large" and 0 otherwise. Summing this indicator is equivalent to counting the number of rows with "Large".
df_summary <- df %>%
group_by(Group) %>%
mutate(group_n=n())%>%
ungroup() %>%
group_by(Group,Size) %>%
mutate(Large_n=n()) %>%
ungroup() %>%
distinct(Group, .keep_all = T)
# A tibble: 2 x 4
Group Size group_n Large_n
<chr> <chr> <int> <int>
1 A Large 2 2
2 B Large 3 1

Performing operations on dplyr summaries

Assume we have some random data:
data <- data.frame(ID = rep(seq(1:3),3),
Var = sample(1:9, 9))
we can compute summarizing operations using dplyr, like this:
library(dplyr)
data%>%
group_by(ID)%>%
summarize(count = n_distinct(Var))
which gives output that looks like this below an r markdown chunk:
ID count
1 3
2 3
3 3
I would like to know how we can perform operations on individual data points in this dplyr output without saving the output in a separate object.
For example in the output of summarise, lets say we wanted to subtract the output value for ID == 3 from the sum of the output values for ID == 1 and ID == 2, and leave the output values for ID == 1 and ID == 2 like they are. The only way I know to do this is to save the summary output in another object and perform the operation on that object, like this:
a<-
data%>%
group_by(ID)%>%
summarize(count = n_distinct(Var))
a
#now perform the operation on a
a[3,2] <- a[2,1]+a[2,2]-1
a
a now looks like this:
ID count
1 3
2 3
3 4
Is there a way to do this in dplyr output without making new objects? Can we somehow use mutate directly on output like this?
We can add a mutate after the summarise with replace to modify the location specified in list
library(dplyr)
data%>%
group_by(ID)%>%
summarize(count = n_distinct(Var)) %>%
mutate(count = replace(count, n(), count[2] + ID[2] - 1))
-output
# A tibble: 3 x 2
ID count
<int> <dbl>
1 1 3
2 2 3
3 3 4
Or if there are more than two columns, use sum on the sliced row
data%>%
group_by(ID)%>%
summarize(count = n_distinct(Var)) %>%
mutate(count = replace(count, n(), sum(cur_data() %>%
slice(2)) - 1))
Alternative that does what you say you want ("sum others") but not what you demonstrate.
data %>%
group_by(ID) %>%
summarize(count = n_distinct(Var)) %>%
mutate(count = if_else(ID == 3L, sum(count) - count, count))
# # A tibble: 3 x 2
# ID count
# <int> <int>
# 1 1 3
# 2 2 3
# 3 3 6
or, if there are other IDs that should not be included in the sum, then
data %>%
group_by(ID) %>%
summarize(count = n_distinct(Var)) %>%
mutate(count = if_else(ID == 3L, sum(count[ID %in% 1:2]), count))

How to use slice in dplyr to keep the rows with NA values in R

I have the following dataset, and I want to know the min word for each group, and if there is no min word (it is NA), I still want to display it
df=data.frame(
key=c("A","A","B","B","C"),
word=c(1,2,3,5,NA))
df%>%group_by(key)%>%slice(which.min(word))
This excludes key=C, word=NA which I would want:
df_out=data.frame(
key=c("A","B","C"),
word=c(1,3,NA))
We can create a logical condition with is.na in filter and return the NA rows as well after doing the grouping by 'key'
library(dplyr)
df %>%
group_by(key) %>%
filter(word == min(word)|is.na(word))
Or using slice. We don't need any if/else condition
df %>%
group_by(key) %>%
slice(which(word ==min(word)|is.na(word)))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Or more compactly
df %>%
group_by(key) %>%
slice(match(min(word), word))
# A tibble: 3 x 2
# Groups: key [3]
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
NOTE: Using match returns the index of the first match.
which.min removes the NA
which.min(c(NA, 1, 3))
#[1] 2
We can check the condition with if, If all the word in a group is NA we return the first row or else return the minimum row.
library(dplyr)
df %>%
group_by(key)%>%
slice(if(all(is.na(word))) 1L else which.min(word))
# key word
# <chr> <dbl>
#1 A 1
#2 B 3
#3 C NA
Another option is to arrange the data by word and select the 1st row in each group.
df %>% arrange(key, word) %>% group_by(key) %>% slice(1L)
You can create a modified slice-function using the tidyverse-package, which returns NA's:
slice_uneven = function(.data, .idx) {
.data_ = .data %>% add_row() # Add an extra row
.idx_ = .idx %>% c(NA) %>% replace_na(nrow(.data_)) # Replace NA with index of the extra row
.data_[.idx_,] %>% head(-1) %>% remove_rownames() %>% return() # Subset, remove extra row, and reset rownames before returning data
}
slice_uneven(cars, c(1, 2, 3, NA, NA, 3, 2))
You can also arrange by word and use distinct from dplyr to get the desired output.
library(dplyr)
df %>%
arrange(word) %>%
distinct(key, .keep_all = TRUE)
# key word
#1 A 1
#2 B 3
#3 C NA

using mutate with row and column indexing and group by

I want to create a variable using dplyr that takes in a value conditional on another variable.
See example below.
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
I want to create a variable 'baseline' that takes the content of variable 'value' where time = 1 and by group. As such the desired output would be
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1),
baseline = c(1,1,3,3)))
Tried to run the following code with indexing but am clearly going wrong somewhere
x <- data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
x %>% group_by(group) %>%
mutate(baseline = .[[.$time==1,.$value]])
Thanks
We can use which.min
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(baseline = value[which.min(time)])
# A tibble: 4 x 4
# Groups: group [2]
# group time value baseline
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 2 1
#3 b 1 3 3
#4 b 2 4 3
and if it is already ordered by 'time', then simply use first
df1 %>%
group_by(group) %>%
mutate(baseline = first(value))
data
df1 <- data.frame(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))

Resources