How to keep ties when using dplyr distinct function? - r

I'm using dplyr distinct() with multiple variables and am trying to figure out how to handle "ties". For example, when running the code at the bottom of this post against example data frame label_1, I'd like to get these results in situations like this where there's a tie with eleCnt and grpID variables:
Element Group eleCnt grpID grpRnk Explain grpRnk column...
<chr> <dbl> <int> <int> <int>
1 R 1 1 3 1 Ranked 1st since it has lowest eleCnt & lowest grpID
2 X 3 1 3 1 Also ranked 1st since it ties with above in terms of eleCnt and grpID
3 R 2 3 7 2 Ranked 2nd since its eleCnt is 2nd and its grpRnk is 2nd
When I run the code against data frame label_2, there are no ties and the code gives me this correct output:
Element Group eleCnt grpID grpRnk
<chr> <dbl> <dbl> <dbl> <int>
1 B 2 1 3 1
2 R 3 1 6 2
3 X 4 1 10 3
4 R 1 4 9 4
5 R 2 6 13 5
Any recommendations for an efficient way to do this, preferably in dplyr? Maybe distinct() isn't the right function to be using?
Code:
library(dplyr)
label_1 <- data.frame(Element=c("B","R","R","R","R","B","X","X","X","X","X"),
Group = c(0,1,1,2,2,0,3,3,0,0,0),
eleCnt = c(1,1,2,3,4,2,1,2,3,4,5),
grpID = c(0,3,3,7,7,0,3,3,0,0,0))
label_2 <- data.frame(Element = c("R","R","R","X","X","X","X","B","B","R","R","R","R"),
Group = c(3,3,3,4,4,4,4,2,2,1,1,2,2),
eleCnt = c(1,2,3,1,2,3,4,1,2,4,5,6,7),
grpID = c(6,6,6,10,10,10,10,3,3,9,9,13,13))
label_2 %>% select(Element,Group,eleCnt,grpID) %>%
filter(Group > 0) %>%
group_by(Element,Group) %>%
slice(which.min(Group)) %>%
ungroup() %>%
distinct(eleCnt,grpID, .keep_all = TRUE) %>%
arrange(eleCnt,grpID) %>%
mutate(grpRnk = 1:n())

Perhaps you can leverage data.table::rleid() function, like this:
f <- function(lab) {
filter(lab,Group!=0) %>%
arrange(eleCnt,grpID) %>%
mutate(grpRnk = data.table::rleid(eleCnt,grpID)) %>%
group_by(grpID) %>%
filter(grpRnk==min(grpRnk))
}
Apply f() to label_1
f(label_1)
# A tibble: 3 x 5
# Groups: grpID [2]
Element Group eleCnt grpID grpRnk
<chr> <dbl> <dbl> <dbl> <int>
1 R 1 1 3 1
2 X 3 1 3 1
3 R 2 3 7 3
Apply f() to label_1
f(label_2)
# A tibble: 5 x 5
# Groups: grpID [5]
Element Group eleCnt grpID grpRnk
<chr> <dbl> <dbl> <dbl> <int>
1 B 2 1 3 1
2 R 3 1 6 2
3 X 4 1 10 3
4 R 1 4 9 9
5 R 2 6 13 12

Related

How to use dplyr distinct function with multiple data frame variables and when there are ties?

I'm using dplyr distinct() for the first time and I'm trying to figure out how to use it with multiple variables and how to handle "ties". For example, when I run the code shown at the bottom of this post against example data frame label_18, I get the below correct results as shown and explained here (note that there no ties with eleCnt and grpID columns in this example):
Element Group eleCnt grpID grpRnk Explain grpRnk column...
<chr> <dbl> <int> <int> <int>
1 B 2 1 3 1 Ranked 1st since it has lowest eleCnt & lowest grpID
2 R 3 1 6 2 Ranked 2nd since it has lowest elecCnt & 2nd lowest grpID
3 X 4 1 10 3 Same pattern as above
4 R 1 4 9 4 Same pattern as above
5 R 2 6 13 5 Same pattern as above
Now when I run the code against label_7, there is a tie between eleCnt and grpID, and I get these results:
Element Group eleCnt grpID grpRnk
<chr> <dbl> <int> <int> <int>
1 R 1 1 3 1
2 R 2 3 7 2
Expected output: I would like the results for label_7 to be (while retaining the output for label_18 shown above):
Element Group eleCnt grpID grpRnk Explain grpRnk column...
<chr> <dbl> <int> <int> <int>
1 R 1 1 3 1 Ranked 1st since it has lowest eleCnt & lowest grpID
2 X 3 1 3 1 Also ranked 1st since it ties with above
3 R 2 3 7 2 Ranked 2nd since its eleCnt is 2nd and its grpRnk is 2nd
How do I modify distinct() for handling ties, so I can get the desired results for label_7 while keeping the same results for label_18? Maybe there's a better way to do this completely, some function other than distinct() for this sort of thing.
Code:
library(dplyr)
label_7 <- data.frame(Element=c("B","R","R","R","R","B","X","X","X","X","X"),
Group = c(0,1,1,2,2,0,3,3,0,0,0),
eleCnt = c(1,1,2,3,4,2,1,2,3,4,5),
grpID = c(0,3,3,7,7,0,3,3,0,0,0))
label_18 <- data.frame(Element = c("R","R","R","X","X","X","X","B","B","R","R","R","R"),
Group = c(3,3,3,4,4,4,4,2,2,1,1,2,2),
eleCnt = c(1,2,3,1,2,3,4,1,2,4,5,6,7),
grpID = c(6,6,6,10,10,10,10,3,3,9,9,13,13))
label_7 %>% select(Element,Group,eleCnt,grpID) %>%
filter(Group > 0) %>%
group_by(Element,Group) %>%
slice(which.min(Group)) %>%
ungroup() %>%
distinct(eleCnt,grpID, .keep_all = TRUE) %>%
arrange(eleCnt,grpID) %>%
mutate(grpRnk = 1:n())
Edit: adding another data frame to test against, label_15 --
> label_15
Element Group eleCnt grpID
1 B 0 1 0
2 R 1 1 3
3 R 1 2 3
4 R 0 3 0
5 X 2 1 3
6 X 2 2 3
7 X 3 3 7
8 X 3 4 7
Expected results would be similar to label_7, because of a tie between Elements R and X in rows 2 and 5 of the above data frame:
Element Group eleCnt grpID grpRank
<chr> <dbl> <dbl> <dbl> <int>
1 R 1 1 3 1
2 X 2 1 3 1
3 X 3 3 7 2
Code for label_15 data frame:
label_15 <- data.frame(Element = c("B","R","R","R","X","X","X","X"),
Group = c(0,1,1,0,2,2,3,3),
eleCnt = c(1,1,2,3,1,2,3,4),
grpID = c(0,3,3,0,3,3,7,7))
We could try
library(dplyr)
library(data.table)
label_7 %>%
select(Element,Group,eleCnt,grpID) %>%
filter(Group > 0) %>%
group_by(Element,Group) %>%
slice(which.min(Group)) %>%
ungroup() %>%
distinct(tmp = rleid(eleCnt, grpID), .keep_all = TRUE) %>%
arrange(eleCnt,grpID) %>%
select(-tmp) %>%
mutate(grpRank= match(grpID, unique(grpID)))
-output
# A tibble: 3 × 5
Element Group eleCnt grpID grpRank
<chr> <dbl> <dbl> <dbl> <int>
1 R 1 1 3 1
2 X 3 1 3 1
3 R 2 3 7 2
For the second case
label_18 %>%
select(Element,Group,eleCnt,grpID) %>%
filter(Group > 0) %>%
group_by(Element,Group) %>%
slice(which.min(Group)) %>%
ungroup() %>%
distinct(tmp = rleid(eleCnt, grpID), .keep_all = TRUE) %>%
arrange(eleCnt,grpID) %>%
select(-tmp) %>%
mutate(grpRank= match(grpID, unique(grpID)))
-output
# A tibble: 5 × 5
Element Group eleCnt grpID grpRank
<chr> <dbl> <dbl> <dbl> <int>
1 B 2 1 3 1
2 R 3 1 6 2
3 X 4 1 10 3
4 R 1 4 9 4
5 R 2 6 13 5
Here's another possibility that correctly processes all 3 scenarios in the post:
filter(label_15,Group!=0) %>%
arrange(eleCnt,grpID) %>%
mutate(grpRnk = data.table::rleid(eleCnt,grpID)) %>%
group_by(grpID) %>%
filter(grpRnk==min(grpRnk)) %>%
ungroup %>%
mutate(grpRnk=data.table::rleid(grpID))
Output:
# A tibble: 3 x 5
Element Group eleCnt grpID grpRnk
<chr> <dbl> <dbl> <dbl> <int>
1 R 1 1 3 1
2 X 2 1 3 1
3 X 3 3 7 2

Extract the data if the first row of each id is 1 using R

Here, I made a simple data to demonstrate what I want to do.
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
date=c(20220311,20220315,20220317,20220514,20220517,20220518,20220519,20220613,20220618),
disease=c(0,1,0,1,1,1,0,1,1))
id stands for a personal id. disease=1 means that person has a disease. disease=0 means that person doesn't have a disease.There are 3 people in df.For id equals 1, the first row of the value of disease is 0. On the other hand, the first two rows of the value of disease for id 2 and 3 are 1. I want to extract the data if the first row of each id is 1.
So, I should extract the data with id 2 and 3. My expected output is
df<-data.frame(id=c(2,2,2,2,3,3),
date=c(20220514,20220517,20220518,20220519,20220613,20220618),
disease=c(1,1,1,0,1,1))
You can use a filter where you select the first row_number and condition you want per group_by with any to get the group like this:
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
date=c(20220311,20220315,20220317,20220514,20220517,20220518,20220519,20220613,20220618),
disease=c(0,1,0,1,1,1,0,1,1))
library(dplyr)
df %>%
group_by(id) %>%
filter(any(row_number() == 1 & disease == 1))
#> # A tibble: 6 × 3
#> # Groups: id [2]
#> id date disease
#> <dbl> <dbl> <dbl>
#> 1 2 20220514 1
#> 2 2 20220517 1
#> 3 2 20220518 1
#> 4 2 20220519 0
#> 5 3 20220613 1
#> 6 3 20220618 1
Created on 2022-07-25 by the reprex package (v2.0.1)
If you only want to select the rows that meet your condition you can use this:
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
date=c(20220311,20220315,20220317,20220514,20220517,20220518,20220519,20220613,20220618),
disease=c(0,1,0,1,1,1,0,1,1))
library(dplyr)
df %>%
group_by(id) %>%
filter(row_number() == 1 & disease == 1)
#> # A tibble: 2 × 3
#> # Groups: id [2]
#> id date disease
#> <dbl> <dbl> <dbl>
#> 1 2 20220514 1
#> 2 3 20220613 1
Created on 2022-07-25 by the reprex package (v2.0.1)
We could also do like this:
library(dplyr)
df %>%
group_by(id) %>%
filter(first(disease)==1)
id date disease
<dbl> <dbl> <dbl>
1 2 20220514 1
2 2 20220517 1
3 2 20220518 1
4 2 20220519 0
5 3 20220613 1
6 3 20220618 1
In base R you can do:
ids_disease <- df$id[!duplicated(df$id) & df$disease == 1]
df[df$id %in% ids_disease, ]

group by two sets of vars in a function

I'm using the sample dataset below:
mytable <- read.table(text=
"group team num ID
1 a x 1 9
2 a x 2 4
3 a y 3 5
4 a y 4 9
5 b x 1 7
6 b y 4 4
7 b x 3 9
8 b y 2 8",
header = TRUE, stringsAsFactors = FALSE)
I want to create separate data frames for each set of variables that I want to group by, I also want to group by two variables as well... I'm not sure how to do that. For example, I want a separate dataframe that groups the data by both team and ID as well... how do I do that?
library(dplyr)
lapply(c("group","team","ID",c("team","ID")), function(x){
group_by(mytable,across(c(x,num)))%>%summarise(Count = n()) %>% mutate(new=x)%>% as.data.frame()
})
See if this is what you want.
library(dplyr)
cols <- list("group","team","ID", c("team","ID"))
lapply(cols, function(x, dat = mytable){
dat2 <- dat %>%
group_by(across({{x}})) %>%
summarise(Count = n()) %>%
mutate(new = toString(x)) %>%
as.data.frame()
return(dat2)
})
# `summarise()` has grouped output by 'team'. You can override using the `.groups` argument.
# [[1]]
# group Count new
# 1 a 4 group
# 2 b 4 group
#
# [[2]]
# team Count new
# 1 x 4 team
# 2 y 4 team
#
# [[3]]
# ID Count new
# 1 4 2 ID
# 2 5 1 ID
# 3 7 1 ID
# 4 8 1 ID
# 5 9 3 ID
#
# [[4]]
# team ID Count new
# 1 x 4 1 team, ID
# 2 x 7 1 team, ID
# 3 x 9 2 team, ID
# 4 y 4 1 team, ID
# 5 y 5 1 team, ID
# 6 y 8 1 team, ID
# 7 y 9 1 team, ID
Does this, based on tidyverse, give you what you want?
library(tidyverse)
ytable %>%
group_by(team, ID) %>%
group_split()
<list_of<
tbl_df<
group: character
team : character
num : integer
ID : integer
>
>[7]>
[[1]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 a x 2 4
[[2]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 b x 1 7
[[3]]
# A tibble: 2 × 4
group team num ID
<chr> <chr> <int> <int>
1 a x 1 9
2 b x 3 9
[[4]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 b y 4 4
[[5]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 a y 3 5
[[6]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 b y 2 8
[[7]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 a y 4 9

How can a table be rearranged one step at a time so that two or more observations are listed in a row in successive columns?

So far I have done this to achieve the desired result:
# A tibble: 4 x 2
frag treat
<dbl> <dbl>
1 1 1
2 2 1
3 1 2
4 2 2
treat_1 <- tab_example %>% filter(treat == "1")
treat_2 <- tab_example %>% filter(treat == "2")
new_tab_example <- full_join(treat_1, treat_2, by = "frag")
> new_tab_example
# A tibble: 2 x 3
frag treat.x treat.y
<dbl> <dbl> <dbl>
1 1 1 2
2 2 1 2
Is there a way to do it in one step?
You can use pivot_wider :
tidyr::pivot_wider(tab_example, names_from = treat,
names_prefix = 'treat', values_from = treat)
# frag treat1 treat2
# <dbl> <dbl> <dbl>
#1 1 1 2
#2 2 1 2
There is a way using spread() function:
library(dplyr)
library(tidyr)
# Yours data
df = tibble(frag = c(1, 2, 1, 2), treat = c(1,1,2,2) )
dfnew = df %>%
mutate(treat_name = case_when(treat==1 ~ 'treat.x', # Build names of columns
treat==2 ~ 'treat.y')
) %>%
spread(treat_name, treat) # Use spread function
If you print the result:
print(dfnew)
# A tibble: 2 x 3
frag treat.x treat.y
<dbl> <dbl> <dbl>
1 1 1 2
2 2 1 2

Filter a tibble with two conditions

i have this tibble..
tibble(id=c(4,4), client=c(5,10), stock=c(NA,10))
# A tibble: 2 x 3
id client stock
<dbl> <dbl> <dbl>
1 4 5 NA
2 4 10 10
from which i want to keep the row where client == 5 and stock == 10. How would i filter that? So my desired outcome would be:
# A tibble: 1 x 3
a client stock
<dbl> <dbl> <dbl>
1 4 5 10
Not sure about the context of filtering using values from different rows, but see if below operation works for you.
> library(dplyr)
> df %>% fill(stock, .direction = 'up') %>% filter(client == 5 & stock == 10)
# A tibble: 1 x 3
id client stock
<dbl> <dbl> <dbl>
1 4 5 10

Resources