Replacing missing values by group and identifying mutual exclusiveness

Replacing missing values by group and identifying mutual exclusiveness - r

I am working with the grouped data in R.
In the following data example, I would like to fill the missing values in "sex" variable, and keep as is if there was no corresponding data (i.e. for id=6).
In the "diagnosis" variable, some had only one diagnosis and some had multiple diagnosis. So, I also would like to group the variable "diagnosis" into "wanted" to identify mutual exclusiveness.
The example data is;
d.f <- tribble (
~id, ~sex, ~diagnosis,
1, "M", "A",
1, NA, "B",
1, NA, "C",
2, NA, "A",
2, "F", NA,
2, NA, "A",
3, NA, NA,
3, "M", "A",
3, "M", "B",
4, "F", "C",
5, "F", "B",
6, NA, "A",
7, "M", NA
)
The desired data is ;
wanted <- tribble (
~id, ~sex, ~diagnosis,~wanted,
1, "M", "A", "ABC group",
1, "M", "B", "ABC group",
1, "M", "C", "ABC group",
2, "F", "A", "Only A",
2, "F", NA, "Only A",
2, "F", "A", "Only A",
3, "M", NA, "AB group",
3, "M", "A", "AB group",
3, "M", "B", "AB group",
4, "F", "C", "Only C",
5, "F", "B", "Only B",
6, NA, "A", "Only A",
7, "M", NA, "Missing"
)

mutate sex column by using first(na.omit(sex)) first is just an aggregating function which is safe to use here
another column say wanted can be mutated in two steps.
paste all strings together in the group using paste(unique(na.omit(diagnosis)), collapse = '')
thereafter use case_when to mutate strings as per your choice
library(tidyverse)
d.f %>%
group_by(id) %>%
mutate(sex = first(na.omit(sex)),
wanted = { x <- paste(unique(na.omit(diagnosis)), collapse = '');
case_when(nchar(x) == 1 ~ paste0('Only ', x),
nchar(x) == 0 ~ 'Missing',
TRUE ~ paste(x, ' Group'))})
#> # A tibble: 13 x 4
#> # Groups: id [7]
#> id sex diagnosis wanted
#> <dbl> <chr> <chr> <chr>
#> 1 1 M A ABC Group
#> 2 1 M B ABC Group
#> 3 1 M C ABC Group
#> 4 2 F A Only A
#> 5 2 F <NA> Only A
#> 6 2 F A Only A
#> 7 3 M <NA> AB Group
#> 8 3 M A AB Group
#> 9 3 M B AB Group
#> 10 4 F C Only C
#> 11 5 F B Only B
#> 12 6 <NA> A Only A
#> 13 7 M <NA> Missing

library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
group_by(id) %>%
drop_na(diagnosis) %>%
summarise(wanted = str_c(c(unique(diagnosis)), collapse = "")) %>%
full_join(df1, . , by = "id") %>%
group_by(id) %>%
fill(sex, .direction = "updown")
#> # A tibble: 13 x 4
#> # Groups: id [7]
#> id sex diagnosis wanted
#> <dbl> <chr> <chr> <chr>
#> 1 1 M A ABC
#> 2 1 M B ABC
#> 3 1 M C ABC
#> 4 2 F A A
#> 5 2 F <NA> A
#> 6 2 F A A
#> 7 3 M <NA> AB
#> 8 3 M A AB
#> 9 3 M B AB
#> 10 4 F C C
#> 11 5 F B B
#> 12 6 <NA> A A
#> 13 7 M <NA> <NA>

This can also be used:
library(dplyr)
d.f %>%
group_by(id) %>%
mutate(sex = coalesce(sex, sex[!is.na(sex)][1]),
wanted = across(diagnosis, ~ {x <- unique(diagnosis[!is.na(diagnosis)])
if_else(length(x) > 1, paste(paste(x, collapse = ""), "Group"),
if_else(length(x) == 1, paste("Only", x[1]), "Missing")
)}))
# A tibble: 13 x 4
# Groups: id [7]
id sex diagnosis wanted$diagnosis
<dbl> <chr> <chr> <chr>
1 1 M A ABC Group
2 1 M B ABC Group
3 1 M C ABC Group
4 2 F A Only A
5 2 F NA Only A
6 2 F A Only A
7 3 M NA AB Group
8 3 M A AB Group
9 3 M B AB Group
10 4 F C Only C
11 5 F B Only B
12 6 NA A Only A
13 7 M NA Missing

Related

R create serial number based on two different columns [duplicate]

I have a data frame, which looks like this:
DF_A <- data.frame(
Group_1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C"),
Group_2 = c("A", "B", "C", "A", "B", "A", "B", "A", "C", "A")
)
I would like to assign a consecutive number for Group_1 IDs which should be unique for the case of identical Group_2 IDs. For example, A+A starts with 1, A+B proceeds with 2 (same Group_1 ID, but new Group_2 ID), ..., A+A is again 1 (obviously a repetition). B+A is 1 (new Group_1 ID), ..., B+A (same Group_1 ID, but new Group_2 ID)...and so forth.
The result should look like this.
DF_B <- data.frame(
Group_1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C"),
Group_2 = c("A", "B", "C", "A", "B", "A", "B", "A", "C", "A"),
ID = c(1, 2, 3, 1, 2, 1, 2, 1, 1, 1)
)
I investigated various posts on corresponding approaches such as single groups within groups, or a combination - without any success - this case is not covered by previous posts.
Thank you in advance.

One way to do it with ave is
DF_A$ID <- ave(DF_A$Group_2, DF_A$Group_1, FUN = function(x) match(x, unique(x)))
DF_A
# Group_1 Group_2 ID
#1 A A 1
#2 A B 2
#3 A C 3
#4 A A 1
#5 A B 2
#6 B A 1
#7 B B 2
#8 B A 1
#9 B C 3
#10 C A 1
The equivalent dplyr way is :
library(dplyr)
DF_A %>%
group_by(Group_1) %>%
mutate(ID = match(Group_2, unique(Group_2)))

You can split into groups by Group_1, then create factor out of your combinations within each group then convert into integer
DF_A$ID <- unlist(by(DF_A, DF_A$Group_1, function(x) as.integer(factor(x$Group_2))))

We can use the dense_rank from dplyr.
library(dplyr)
DF_A2 <- DF_A %>%
group_by(Group_1) %>%
mutate(ID = dense_rank(Group_2)) %>%
ungroup()
DF_A2
# # A tibble: 10 x 3
# Group_1 Group_2 ID
# <fct> <fct> <int>
# 1 A A 1
# 2 A B 2
# 3 A C 3
# 4 A A 1
# 5 A B 2
# 6 B A 1
# 7 B B 2
# 8 B A 1
# 9 B C 3
# 10 C A 1

You could use the integer values of the factor levels. We can simply wrap Group_2 in c() to drop the factor attribute.
within(DF_A, { ID = ave(c(Group_2), Group_1, FUN = c) })
# Group_1 Group_2 ID
# 1 A A 1
# 2 A B 2
# 3 A C 3
# 4 A A 1
# 5 A B 2
# 6 B A 1
# 7 B B 2
# 8 B A 1
# 9 B C 3
# 10 C A 1

R: Create numbering within each group

The data that I have:
x = tibble(
study = c("A", "B", "C", "A", "B", "A", "B", "C", "A", "B"),
ID = c(001, 001, 001, 005, 005, 007, 007, 007, 012, 012)
)
The goal is to create the 'number' variable which shows the same number for each unique ID in sequence starting from 1.
goal = tibble(
study = c("A", "B", "C", "A", "B", "A", "B", "C", "A", "B"),
ID = c(001, 001, 001, 005, 005, 007, 007, 007, 012, 012),
number = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4)
)
And then if within each ID group, the studies are incomplete (e.g., for number = 2, the studies are only A and B, instead of A, B, C), then how to remove the obs associated with that ID (e.g., remove obs that have a number of '2')?
Thanks!
Updated follow-up question on part B:
Once we have the goal dataset, I would like to remove the obs grouped by ID, that meet the following requirements in terms of the study variable:
A and D are required, one of B and C is required (so either B or C), and sometimes each letter will appear more than once.
x = tibble(
study = c("A", "B", "C", "D", "A", "B", "A", "B", "C", "A", "B", "C", "D", "D", "A", "B", "D", "B", "C", "D"),
ID = c(001, 001, 001, 001, 005, 005, 007, 007, 007, 012, 012, 012, 012, 012, 013, 013, 013, 018, 018, 018),
number = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 6, 6, 6)
)
So in the goal dataset above, I would like to remove:
(1) Obs #5 and 6 which share a group number of 2, because they don't have A, B or C, and D in the study variable.
(2) Obs #18, 19, 20 which share a group number of 6, for the same reason as (1).
I would like to keep the rest of the obs because within each number group, they have A, B or C, and D. I cannot use filter(n() > 3) here, because that would delete obs with the number 5.

We could use cur_group_id()
library(dplyr)
x %>%
group_by(ID) %>%
mutate(number = cur_group_id())
study ID number
<chr> <dbl> <int>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 5 2
5 B 5 2
6 A 7 3
7 B 7 3
8 C 7 3
9 A 12 4
10 B 12 4
OR
library(dplyr)
x %>%
mutate(number = cumsum(ID != lag(ID, default = first(ID)))+1)
study ID number
<chr> <dbl> <dbl>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 5 2
5 B 5 2
6 A 7 3
7 B 7 3
8 C 7 3
9 A 12 4
10 B 12 4

A) The dplyr package offers group_indices() for adding unique group indentifiers:
library(dplyr)
df$number <- df %>%
group_indices(ID)
df
# A tibble: 10 × 3
study ID number
<chr> <dbl> <int>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 5 2
5 B 5 2
...
B) You can drop observations where the group size is less than 3 (i.e., "A", "B" and "C") with filter():
df %>%
group_by(ID) %>%
filter(n() == 3)
# A tibble: 6 × 3
# Groups: ID [2]
study ID number
<chr> <dbl> <int>
1 A 1 1
2 B 1 1
3 C 1 1
4 A 7 3
5 B 7 3
6 C 7 3

A and D are required, one of B and C is required (so either B or C)
df %>%
group_by(ID) %>%
mutate(
flag =
(
any(study %in% c("A")) &
any(study %in% c("D"))
) &
(
any(study %in% c("B")) |
any(study %in% c("C"))
)
) %>%
filter(flag)
# A tibble: 12 × 4
# Groups: ID [3]
study ID number flag
<chr> <dbl> <dbl> <lgl>
1 A 1 1 TRUE
2 B 1 1 TRUE
3 C 1 1 TRUE
4 D 1 1 TRUE
5 A 12 4 TRUE
6 B 12 4 TRUE
7 C 12 4 TRUE
8 D 12 4 TRUE
9 D 12 4 TRUE
10 A 13 5 TRUE
11 B 13 5 TRUE
12 D 13 5 TRUE

how to add a column to identify specific combination of values in R?

I have a database with several columns ( >20) and 2 of these columns have the subject names. I would like to add another column with inside a number that identifies the combination of the two subjects.
Here is an example with only the 2 columns of names (I don't include the others for convenience):
ID1 ID2
A B
A C
A B
B C
A B
B A
C B
And here is what i would like to create:
ID1 ID2 CODE
A B 1
A C 2
A B 1
B C 3
A B 1
B A 1
C B 3
I am kind of new in R and I think it can be done with stringr but I am not sure how
Thanks for the help!
Simo

df$CODE <- as.integer(
factor(
apply(df, 1, function(x) paste0(sort(x), collapse = ""))
)
)
# ID1 ID2 CODE
# 1 A B 1
# 2 A C 2
# 3 A B 1
# 4 B C 3
# 5 A B 1
# 6 B A 1
# 7 C B 3
Data
df <- data.frame(
ID1 = c("A", "A", "A", "B", "A", "B", "C"),
ID2 = c("B", "C", "B", "C", "B", "A", "B")
)

Try this:
library(dplyr)
#Code
new <- df %>% rowwise() %>%
mutate(Var = paste0(sort(c(ID1, ID2)), collapse = '')) %>%
group_by(Var) %>%
mutate(CODE=cur_group_id()) %>%
ungroup() %>%
select(-Var)
Output:
# A tibble: 7 x 3
ID1 ID2 CODE
<chr> <chr> <int>
1 A B 1
2 A C 2
3 A B 1
4 B C 3
5 A B 1
6 B A 1
7 C B 3
Some data used:
#Data
df <- structure(list(ID1 = c("A", "A", "A", "B", "A", "B", "C"), ID2 = c("B",
"C", "B", "C", "B", "A", "B")), class = "data.frame", row.names = c(NA,
-7L))

Logic for filtering dependent on two columns [duplicate]

This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 2 years ago.
I am struggling to write the right logic to filter two columns based only on the condition in one column. I have multiple ids and if an id appears in 2020, I want all the data for the other years that id was measured to come along.
As an example, if a group contains the number 3, I want all the values in that group. We should end up with a dataframe with all the b and d rows.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
threes <- df4 %>%
filter(pop == 3 |&ifelse????

A bit slower than the other answers here (more steps involved), but for me a bit clearer:
df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group) -> groups
df4 %>%
filter(group %in% groups)
or if you want to combine the two steps:
df4 %>%
filter(group %in% df4 %>%
filter(pop == 3) %>%
distinct(group) %>%
pull(group))

You can do:
df4[df4$group %in% df4$group[df4$pop == 3],]
#> group pop value
#> 6 b 1 2.0
#> 7 b 2 3.0
#> 8 b 3 4.0
#> 9 b 4 3.5
#> 10 b 5 3.0
#> 16 d 1 0.5
#> 17 d 2 1.5
#> 18 d 3 6.0
#> 19 d 4 2.0
#> 20 d 5 1.5

You can do this way using dplyr group_by(), filter() and any() function combined. any() will return TRUE for the matching condition. Group by will do the operation for each subgroup of the variable you mention as a grouping.
Follow these steps:
First pipe the data to group_by() to group by your group variable.
Then pipe to filter() to filter by if any group pop is equal to 3 using any() function.
df4 <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "d", "d", "d", "d", "d"),
pop = c(1, 2, 2, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 4, 5, 1, 2, 3, 4, 5),
value = c(1,2,3,2.5,2,2,3,4,3.5,3,3,2,1,2,2.5,0.5,1.5,6,2,1.5))
# load the library
library(dplyr)
threes <- df4 %>%
group_by(group) %>%
filter(any(pop == 3))
# print the result
threes
Output:
threes
# A tibble: 10 x 3
# Groups: group [2]
group pop value
<chr> <dbl> <dbl>
1 b 1 2
2 b 2 3
3 b 3 4
4 b 4 3.5
5 b 5 3
6 d 1 0.5
7 d 2 1.5
8 d 3 6
9 d 4 2
10 d 5 1.5

An easy base R option is using subset + ave
subset(
df4,
ave(pop == 3, group, FUN = any)
)
which gives
group pop value
6 b 1 2.0
7 b 2 3.0
8 b 3 4.0
9 b 4 3.5
10 b 5 3.0
16 d 1 0.5
17 d 2 1.5
18 d 3 6.0
19 d 4 2.0

Use dplyr:
df4%>%group_by(group)%>%filter(any(pop==3))

Consecutive Across and Unique Number Within Group

I have a data frame, which looks like this:
DF_A <- data.frame(
Group_1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C"),
Group_2 = c("A", "B", "C", "A", "B", "A", "B", "A", "C", "A")
)
I would like to assign a consecutive number for Group_1 IDs which should be unique for the case of identical Group_2 IDs. For example, A+A starts with 1, A+B proceeds with 2 (same Group_1 ID, but new Group_2 ID), ..., A+A is again 1 (obviously a repetition). B+A is 1 (new Group_1 ID), ..., B+A (same Group_1 ID, but new Group_2 ID)...and so forth.
The result should look like this.
DF_B <- data.frame(
Group_1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C"),
Group_2 = c("A", "B", "C", "A", "B", "A", "B", "A", "C", "A"),
ID = c(1, 2, 3, 1, 2, 1, 2, 1, 1, 1)
)
I investigated various posts on corresponding approaches such as single groups within groups, or a combination - without any success - this case is not covered by previous posts.
Thank you in advance.

One way to do it with ave is
DF_A$ID <- ave(DF_A$Group_2, DF_A$Group_1, FUN = function(x) match(x, unique(x)))
DF_A
# Group_1 Group_2 ID
#1 A A 1
#2 A B 2
#3 A C 3
#4 A A 1
#5 A B 2
#6 B A 1
#7 B B 2
#8 B A 1
#9 B C 3
#10 C A 1
The equivalent dplyr way is :
library(dplyr)
DF_A %>%
group_by(Group_1) %>%
mutate(ID = match(Group_2, unique(Group_2)))

You can split into groups by Group_1, then create factor out of your combinations within each group then convert into integer
DF_A$ID <- unlist(by(DF_A, DF_A$Group_1, function(x) as.integer(factor(x$Group_2))))

We can use the dense_rank from dplyr.
library(dplyr)
DF_A2 <- DF_A %>%
group_by(Group_1) %>%
mutate(ID = dense_rank(Group_2)) %>%
ungroup()
DF_A2
# # A tibble: 10 x 3
# Group_1 Group_2 ID
# <fct> <fct> <int>
# 1 A A 1
# 2 A B 2
# 3 A C 3
# 4 A A 1
# 5 A B 2
# 6 B A 1
# 7 B B 2
# 8 B A 1
# 9 B C 3
# 10 C A 1

You could use the integer values of the factor levels. We can simply wrap Group_2 in c() to drop the factor attribute.
within(DF_A, { ID = ave(c(Group_2), Group_1, FUN = c) })
# Group_1 Group_2 ID
# 1 A A 1
# 2 A B 2
# 3 A C 3
# 4 A A 1
# 5 A B 2
# 6 B A 1
# 7 B B 2
# 8 B A 1
# 9 B C 3
# 10 C A 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replacing missing values by group and identifying mutual exclusiveness - r

Related

R create serial number based on two different columns [duplicate]

R: Create numbering within each group

how to add a column to identify specific combination of values in R?

Logic for filtering dependent on two columns [duplicate]

Consecutive Across and Unique Number Within Group

Categories

Resources