Related
I'm using dplyr distinct() with multiple variables and am trying to figure out how to handle "ties". For example, when running the code at the bottom of this post against example data frame label_1, I'd like to get these results in situations like this where there's a tie with eleCnt and grpID variables:
Element Group eleCnt grpID grpRnk Explain grpRnk column...
<chr> <dbl> <int> <int> <int>
1 R 1 1 3 1 Ranked 1st since it has lowest eleCnt & lowest grpID
2 X 3 1 3 1 Also ranked 1st since it ties with above in terms of eleCnt and grpID
3 R 2 3 7 2 Ranked 2nd since its eleCnt is 2nd and its grpRnk is 2nd
When I run the code against data frame label_2, there are no ties and the code gives me this correct output:
Element Group eleCnt grpID grpRnk
<chr> <dbl> <dbl> <dbl> <int>
1 B 2 1 3 1
2 R 3 1 6 2
3 X 4 1 10 3
4 R 1 4 9 4
5 R 2 6 13 5
Any recommendations for an efficient way to do this, preferably in dplyr? Maybe distinct() isn't the right function to be using?
Code:
library(dplyr)
label_1 <- data.frame(Element=c("B","R","R","R","R","B","X","X","X","X","X"),
Group = c(0,1,1,2,2,0,3,3,0,0,0),
eleCnt = c(1,1,2,3,4,2,1,2,3,4,5),
grpID = c(0,3,3,7,7,0,3,3,0,0,0))
label_2 <- data.frame(Element = c("R","R","R","X","X","X","X","B","B","R","R","R","R"),
Group = c(3,3,3,4,4,4,4,2,2,1,1,2,2),
eleCnt = c(1,2,3,1,2,3,4,1,2,4,5,6,7),
grpID = c(6,6,6,10,10,10,10,3,3,9,9,13,13))
label_2 %>% select(Element,Group,eleCnt,grpID) %>%
filter(Group > 0) %>%
group_by(Element,Group) %>%
slice(which.min(Group)) %>%
ungroup() %>%
distinct(eleCnt,grpID, .keep_all = TRUE) %>%
arrange(eleCnt,grpID) %>%
mutate(grpRnk = 1:n())
Perhaps you can leverage data.table::rleid() function, like this:
f <- function(lab) {
filter(lab,Group!=0) %>%
arrange(eleCnt,grpID) %>%
mutate(grpRnk = data.table::rleid(eleCnt,grpID)) %>%
group_by(grpID) %>%
filter(grpRnk==min(grpRnk))
}
Apply f() to label_1
f(label_1)
# A tibble: 3 x 5
# Groups: grpID [2]
Element Group eleCnt grpID grpRnk
<chr> <dbl> <dbl> <dbl> <int>
1 R 1 1 3 1
2 X 3 1 3 1
3 R 2 3 7 3
Apply f() to label_1
f(label_2)
# A tibble: 5 x 5
# Groups: grpID [5]
Element Group eleCnt grpID grpRnk
<chr> <dbl> <dbl> <dbl> <int>
1 B 2 1 3 1
2 R 3 1 6 2
3 X 4 1 10 3
4 R 1 4 9 9
5 R 2 6 13 12
I'm using dplyr distinct() for the first time and I'm trying to figure out how to use it with multiple variables and how to handle "ties". For example, when I run the code shown at the bottom of this post against example data frame label_18, I get the below correct results as shown and explained here (note that there no ties with eleCnt and grpID columns in this example):
Element Group eleCnt grpID grpRnk Explain grpRnk column...
<chr> <dbl> <int> <int> <int>
1 B 2 1 3 1 Ranked 1st since it has lowest eleCnt & lowest grpID
2 R 3 1 6 2 Ranked 2nd since it has lowest elecCnt & 2nd lowest grpID
3 X 4 1 10 3 Same pattern as above
4 R 1 4 9 4 Same pattern as above
5 R 2 6 13 5 Same pattern as above
Now when I run the code against label_7, there is a tie between eleCnt and grpID, and I get these results:
Element Group eleCnt grpID grpRnk
<chr> <dbl> <int> <int> <int>
1 R 1 1 3 1
2 R 2 3 7 2
Expected output: I would like the results for label_7 to be (while retaining the output for label_18 shown above):
Element Group eleCnt grpID grpRnk Explain grpRnk column...
<chr> <dbl> <int> <int> <int>
1 R 1 1 3 1 Ranked 1st since it has lowest eleCnt & lowest grpID
2 X 3 1 3 1 Also ranked 1st since it ties with above
3 R 2 3 7 2 Ranked 2nd since its eleCnt is 2nd and its grpRnk is 2nd
How do I modify distinct() for handling ties, so I can get the desired results for label_7 while keeping the same results for label_18? Maybe there's a better way to do this completely, some function other than distinct() for this sort of thing.
Code:
library(dplyr)
label_7 <- data.frame(Element=c("B","R","R","R","R","B","X","X","X","X","X"),
Group = c(0,1,1,2,2,0,3,3,0,0,0),
eleCnt = c(1,1,2,3,4,2,1,2,3,4,5),
grpID = c(0,3,3,7,7,0,3,3,0,0,0))
label_18 <- data.frame(Element = c("R","R","R","X","X","X","X","B","B","R","R","R","R"),
Group = c(3,3,3,4,4,4,4,2,2,1,1,2,2),
eleCnt = c(1,2,3,1,2,3,4,1,2,4,5,6,7),
grpID = c(6,6,6,10,10,10,10,3,3,9,9,13,13))
label_7 %>% select(Element,Group,eleCnt,grpID) %>%
filter(Group > 0) %>%
group_by(Element,Group) %>%
slice(which.min(Group)) %>%
ungroup() %>%
distinct(eleCnt,grpID, .keep_all = TRUE) %>%
arrange(eleCnt,grpID) %>%
mutate(grpRnk = 1:n())
Edit: adding another data frame to test against, label_15 --
> label_15
Element Group eleCnt grpID
1 B 0 1 0
2 R 1 1 3
3 R 1 2 3
4 R 0 3 0
5 X 2 1 3
6 X 2 2 3
7 X 3 3 7
8 X 3 4 7
Expected results would be similar to label_7, because of a tie between Elements R and X in rows 2 and 5 of the above data frame:
Element Group eleCnt grpID grpRank
<chr> <dbl> <dbl> <dbl> <int>
1 R 1 1 3 1
2 X 2 1 3 1
3 X 3 3 7 2
Code for label_15 data frame:
label_15 <- data.frame(Element = c("B","R","R","R","X","X","X","X"),
Group = c(0,1,1,0,2,2,3,3),
eleCnt = c(1,1,2,3,1,2,3,4),
grpID = c(0,3,3,0,3,3,7,7))
We could try
library(dplyr)
library(data.table)
label_7 %>%
select(Element,Group,eleCnt,grpID) %>%
filter(Group > 0) %>%
group_by(Element,Group) %>%
slice(which.min(Group)) %>%
ungroup() %>%
distinct(tmp = rleid(eleCnt, grpID), .keep_all = TRUE) %>%
arrange(eleCnt,grpID) %>%
select(-tmp) %>%
mutate(grpRank= match(grpID, unique(grpID)))
-output
# A tibble: 3 × 5
Element Group eleCnt grpID grpRank
<chr> <dbl> <dbl> <dbl> <int>
1 R 1 1 3 1
2 X 3 1 3 1
3 R 2 3 7 2
For the second case
label_18 %>%
select(Element,Group,eleCnt,grpID) %>%
filter(Group > 0) %>%
group_by(Element,Group) %>%
slice(which.min(Group)) %>%
ungroup() %>%
distinct(tmp = rleid(eleCnt, grpID), .keep_all = TRUE) %>%
arrange(eleCnt,grpID) %>%
select(-tmp) %>%
mutate(grpRank= match(grpID, unique(grpID)))
-output
# A tibble: 5 × 5
Element Group eleCnt grpID grpRank
<chr> <dbl> <dbl> <dbl> <int>
1 B 2 1 3 1
2 R 3 1 6 2
3 X 4 1 10 3
4 R 1 4 9 4
5 R 2 6 13 5
Here's another possibility that correctly processes all 3 scenarios in the post:
filter(label_15,Group!=0) %>%
arrange(eleCnt,grpID) %>%
mutate(grpRnk = data.table::rleid(eleCnt,grpID)) %>%
group_by(grpID) %>%
filter(grpRnk==min(grpRnk)) %>%
ungroup %>%
mutate(grpRnk=data.table::rleid(grpID))
Output:
# A tibble: 3 x 5
Element Group eleCnt grpID grpRnk
<chr> <dbl> <dbl> <dbl> <int>
1 R 1 1 3 1
2 X 2 1 3 1
3 X 3 3 7 2
Here, I made a simple data to demonstrate what I want to do.
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
date=c(20220311,20220315,20220317,20220514,20220517,20220518,20220519,20220613,20220618),
disease=c(0,1,0,1,1,1,0,1,1))
id stands for a personal id. disease=1 means that person has a disease. disease=0 means that person doesn't have a disease.There are 3 people in df.For id equals 1, the first row of the value of disease is 0. On the other hand, the first two rows of the value of disease for id 2 and 3 are 1. I want to extract the data if the first row of each id is 1.
So, I should extract the data with id 2 and 3. My expected output is
df<-data.frame(id=c(2,2,2,2,3,3),
date=c(20220514,20220517,20220518,20220519,20220613,20220618),
disease=c(1,1,1,0,1,1))
You can use a filter where you select the first row_number and condition you want per group_by with any to get the group like this:
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
date=c(20220311,20220315,20220317,20220514,20220517,20220518,20220519,20220613,20220618),
disease=c(0,1,0,1,1,1,0,1,1))
library(dplyr)
df %>%
group_by(id) %>%
filter(any(row_number() == 1 & disease == 1))
#> # A tibble: 6 × 3
#> # Groups: id [2]
#> id date disease
#> <dbl> <dbl> <dbl>
#> 1 2 20220514 1
#> 2 2 20220517 1
#> 3 2 20220518 1
#> 4 2 20220519 0
#> 5 3 20220613 1
#> 6 3 20220618 1
Created on 2022-07-25 by the reprex package (v2.0.1)
If you only want to select the rows that meet your condition you can use this:
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
date=c(20220311,20220315,20220317,20220514,20220517,20220518,20220519,20220613,20220618),
disease=c(0,1,0,1,1,1,0,1,1))
library(dplyr)
df %>%
group_by(id) %>%
filter(row_number() == 1 & disease == 1)
#> # A tibble: 2 × 3
#> # Groups: id [2]
#> id date disease
#> <dbl> <dbl> <dbl>
#> 1 2 20220514 1
#> 2 3 20220613 1
Created on 2022-07-25 by the reprex package (v2.0.1)
We could also do like this:
library(dplyr)
df %>%
group_by(id) %>%
filter(first(disease)==1)
id date disease
<dbl> <dbl> <dbl>
1 2 20220514 1
2 2 20220517 1
3 2 20220518 1
4 2 20220519 0
5 3 20220613 1
6 3 20220618 1
In base R you can do:
ids_disease <- df$id[!duplicated(df$id) & df$disease == 1]
df[df$id %in% ids_disease, ]
I'm using the sample dataset below:
mytable <- read.table(text=
"group team num ID
1 a x 1 9
2 a x 2 4
3 a y 3 5
4 a y 4 9
5 b x 1 7
6 b y 4 4
7 b x 3 9
8 b y 2 8",
header = TRUE, stringsAsFactors = FALSE)
I want to create separate data frames for each set of variables that I want to group by, I also want to group by two variables as well... I'm not sure how to do that. For example, I want a separate dataframe that groups the data by both team and ID as well... how do I do that?
library(dplyr)
lapply(c("group","team","ID",c("team","ID")), function(x){
group_by(mytable,across(c(x,num)))%>%summarise(Count = n()) %>% mutate(new=x)%>% as.data.frame()
})
See if this is what you want.
library(dplyr)
cols <- list("group","team","ID", c("team","ID"))
lapply(cols, function(x, dat = mytable){
dat2 <- dat %>%
group_by(across({{x}})) %>%
summarise(Count = n()) %>%
mutate(new = toString(x)) %>%
as.data.frame()
return(dat2)
})
# `summarise()` has grouped output by 'team'. You can override using the `.groups` argument.
# [[1]]
# group Count new
# 1 a 4 group
# 2 b 4 group
#
# [[2]]
# team Count new
# 1 x 4 team
# 2 y 4 team
#
# [[3]]
# ID Count new
# 1 4 2 ID
# 2 5 1 ID
# 3 7 1 ID
# 4 8 1 ID
# 5 9 3 ID
#
# [[4]]
# team ID Count new
# 1 x 4 1 team, ID
# 2 x 7 1 team, ID
# 3 x 9 2 team, ID
# 4 y 4 1 team, ID
# 5 y 5 1 team, ID
# 6 y 8 1 team, ID
# 7 y 9 1 team, ID
Does this, based on tidyverse, give you what you want?
library(tidyverse)
ytable %>%
group_by(team, ID) %>%
group_split()
<list_of<
tbl_df<
group: character
team : character
num : integer
ID : integer
>
>[7]>
[[1]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 a x 2 4
[[2]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 b x 1 7
[[3]]
# A tibble: 2 × 4
group team num ID
<chr> <chr> <int> <int>
1 a x 1 9
2 b x 3 9
[[4]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 b y 4 4
[[5]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 a y 3 5
[[6]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 b y 2 8
[[7]]
# A tibble: 1 × 4
group team num ID
<chr> <chr> <int> <int>
1 a y 4 9
I have a table - read from an excel file, with column names in English and some variables in Hebrew.
As I read the excel file and receive a tibble, the column names don't fit the data.
I use the following code to read the table:
excel_file <- file.path(the file path, the file)
tab_1 <- read_xlsx(excel_file)
tab_1
The result that I'm getting:
# A tibble: 2 x 5
case a b c d
<chr> <dbl> <dbl> <dbl> <dbl>
1 שחור 3 2 1 4
2 אדום 2 5 2 3
>
How can I change the order of the column names? I have looked all over and found no solution.
You can do it by specifying the column indexes
Using the iris dataset as an example
First, change to a tibble
iris2 <- iris %>% as_tibble()
Reverse columns by manually specifying by column index
iris2[,c(5,4,3,2,1)]
Or do the same programatically
iris2[,ncol(iris2):1]
When the tibble becomes wider (more columns) the answer looks something like this
> tab_1 <- tab_1[,ncol(tab_1):1]
> print(tab_1)
# A tibble: 2 x 19
result q p o n m l k j i h g f e d c b a
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 שחור 2 4 5 6 4 6 2 1 2 5 2 3 4 4 1 2 3
2 אדום 4 3 5 5 6 3 0 3 3 4 5 3 5 3 2 5 2
case
<chr>
1 שחור
2 אדום
changing back the column names of the first part
I use the tidyverse generally for content management. The select function is a clean single line to reverse df columns generally. E.g.,
library(tidyverse)
n <- ncol(mtcars)
mtcars2 <- select(mtcars, c(n:1))
An option is also to use rev
library(dplyr)
mtcars %>%
select(rev(names(.)))