Create new variable based on outcome of other variable in group - R - r

This a similar/followup question to this R: How to code new variable based on grouped variable and conditioned on earlier row but it is different because within donors there are potentially two match runs.
I have a data file with organ donors. I'm looking at lungs that are donated - there are two lungs.
If the lungs are split (L and R) and put up for donation, they are each attempted to match with recipients ("matchrun"). They go through eligible recipients until one matches ("sequence").
If the lung is matched to a recipient, it goes to them ("organ_placed").
If the lung doesn't match, it continues in the sequence and then just remains NA at the maximum sequence number.
I would like to create a new variable that has the outcome of the match run such that if one lung is placed and the other is not, it tells you that the lung was discarded. i.e. see case of Donor 2 in the data - the left lung is placed, but the right doesn't match.
In donor 3, the first match run doesn't match but the match run for the other lung does.
I figure it will be something like group_by(donorid, matchrun) but then how do you make a condition based on the match run?
library(tribble)
library(dplyr)
data <- tribble(
~donorid, ~matchrun, ~sequence, ~organ_placed,
2, 3, 1, NA,
2, 3, 2, NA,
2, 3, 3, "L",
2, 4, 1, NA,
2, 4, 2, NA,
2, 4, 3, NA,
3, 5, 1, NA,
3, 5, 1, NA,
3, 5, 1, NA,
3, 6, 1, NA,
3, 6, 2, NA,
3, 6, 3, "L"
)
desired_outcome <- tribble(
~donorid, ~matchrun, ~sequence, ~organ_placed, ~organ,
2, 3, 1, NA, NA,
2, 3, 2, NA, NA,
2, 3, 3, "L", "Left Single",
2, 4, 1, NA, NA,
2, 4, 2, NA, NA,
2, 4, 3, NA, "Right Discarded",
3, 5, 1, NA, NA,
3, 5, 1, NA, NA,
3, 5, 1, NA, "Right Discarded",
3, 6, 1, NA, NA,
3, 6, 2, NA, NA,
3, 6, 3, "L", "Left Single")

You can try this:
data %>%
group_by(donorid) %>%
mutate(temp = ifelse(n_distinct(organ_placed, na.rm = TRUE) == 1, unique(na.omit(organ_placed)), "B")) %>%
group_by(matchrun, .add = TRUE) %>%
mutate(organ = case_when(organ_placed == "L" ~ "Left Single",
organ_placed == "R" ~ "Right Single",
all(is.na(organ_placed)) & row_number() == max(sequence) & temp == "L" ~ "Right Discarded",
all(is.na(organ_placed)) & row_number() == max(sequence) & temp == "R" ~ "Left Discarded")) %>%
ungroup()
output
donorid matchrun sequence organ_placed temp organ
1 1 1 1 NA B NA
2 1 1 2 NA B NA
3 1 1 3 L B Left Single
4 1 2 1 NA B NA
5 1 2 2 NA B NA
6 1 2 3 R B Right Single
7 2 3 1 NA L NA
8 2 3 2 NA L NA
9 2 3 3 L L Left Single
10 2 4 1 NA L NA
11 2 4 2 NA L NA
12 2 4 3 NA L Right Discarded

Update: we have to add matchrun to the group. Removed prior solution:
data %>%
group_by(donorid, matchrun) %>%
mutate(outcome = case_when(organ_placed == "L" ~ "Left Single",
organ_placed == "R" ~ "Right Single",
organ_placed == "B" ~ "Bilateral",
(is.na(organ_placed) &
row_number() == max(row_number())) &
"L" %in% organ_placed ~ "Right Discarded",
(is.na(organ_placed) &
row_number() == max(row_number())) &
"R" %in% organ_placed ~ "Left Discarded",
TRUE ~ NA_character_))
Groups: donorid, matchrun [4]
donorid matchrun sequence organ_placed outcome
<dbl> <dbl> <dbl> <chr> <chr>
1 2 3 1 NA NA
2 2 3 2 NA NA
3 2 3 3 L Left Single
4 2 4 1 NA NA
5 2 4 2 NA NA
6 2 4 3 NA NA
7 3 5 1 NA NA
8 3 5 1 NA NA
9 3 5 1 NA NA
10 3 6 1 NA NA
11 3 6 2 NA NA
12 3 6 3 L Left Single

We can use
library(data.table)
library(stringr)
setDT(data)[, seq2 := rowid(donorid, matchrun) ]
data[, organ := str_replace_all(organ_placed,
setNames(c("Left Single", "Right Single"), c("L", "R")))]
data[seq2 == max(seq2),
organ := fcase(!is.na(organ), organ, default =
str_replace_all(setdiff(c("Left Single", "Right Single"), organ),
setNames(c("Left Discarded", "Right Discarded"),
c("Left Single", "Right Single")))), donorid
][, seq2 := NULL][]
-output
> data
donorid matchrun sequence organ_placed organ
1: 2 3 1 <NA> <NA>
2: 2 3 2 <NA> <NA>
3: 2 3 3 L Left Single
4: 2 4 1 <NA> <NA>
5: 2 4 2 <NA> <NA>
6: 2 4 3 <NA> Right Discarded
7: 3 5 1 <NA> <NA>
8: 3 5 1 <NA> <NA>
9: 3 5 1 <NA> Right Discarded
10: 3 6 1 <NA> <NA>
11: 3 6 2 <NA> <NA>
12: 3 6 3 L Left Single

Related

R incrementing a variable in dplyr

I have the following grouped data frame:
library(dplyr)
# Create a sample dataframe
df <- data.frame(
student = c("A", "A", "A","B","B", "B", "C", "C","C"),
grade = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
age= c(NA, 6, 6, 7, 7, 7, NA, NA, 9)
)
I want to update the age of each student so that it is one plus the age in the previous year, with their age in the first year they appear in the dataset remaining unchanged. For example, student A's age should be NA, 6, 7, student B's age should be 7,8,9, and student C's age should be NA, NA, 9.
How about this:
library(dplyr)
df <- data.frame(
student = c("A", "A", "A","B","B", "B", "C", "C","C"),
grade = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
age= c(NA, 6, 6, 7, 7, 7, NA, NA, 9)
)
df %>%
group_by(student) %>%
mutate(age = age + cumsum(!is.na(age))-1)
#> # A tibble: 9 × 3
#> # Groups: student [3]
#> student grade age
#> <chr> <dbl> <dbl>
#> 1 A 1 NA
#> 2 A 2 6
#> 3 A 3 7
#> 4 B 1 7
#> 5 B 2 8
#> 6 B 3 9
#> 7 C 1 NA
#> 8 C 2 NA
#> 9 C 3 9
Created on 2022-12-30 by the reprex package (v2.0.1)
in data.table, assuming the order of the rows is the 'correct' order:
library(data.table)
setDT(df)[, new_age := age + rowid(age) - 1, by = .(student)]
# student grade age new_age
# 1: A 1 NA NA
# 2: A 2 6 6
# 3: A 3 6 7
# 4: B 1 7 7
# 5: B 2 7 8
# 6: B 3 7 9
# 7: C 1 NA NA
# 8: C 2 NA NA
# 9: C 3 9 9

replace negative values with na using na_if{dplyr}

Let's say I have the following dataframe:
dat <- tribble(
~V1, ~V2,
2, -3,
3, 2,
1, 3,
3, -4,
5, 1,
3, 2,
1, -4,
3, 4,
4, 1,
3, -5,
4, 2,
3, 4
)
How can I replace negative values with NA using na_if()? I know how to do this using ifelse, but don't manage to come up with a correct condition for na_if():
> dat %>%
+ mutate(V2 = ifelse(V2 < 0, NA, V2))
# A tibble: 12 x 2
V1 V2
<dbl> <dbl>
1 2 NA
2 3 2
3 1 3
4 3 NA
5 5 1
6 3 2
7 1 NA
8 3 4
9 4 1
10 3 NA
11 4 2
12 3 4

R - How to fill in values in NA, but only when ending value is the same as the beginning value?

I have the following example data:
Example <- data.frame(col1 =c(1, NA, NA, 4, NA, NA, 6, NA, NA, NA, 6, 8, NA, 2, NA))
col1
1
NA
NA
4
NA
NA
6
NA
NA
NA
6
8
NA
2
NA
I want to fill the NAs with value from above, but only if the NAs are between 2 identical values. In this example the first NA gap from 1 to 4 should not be filled with 1s. But the gap between the first 6 and the second 6 should be filled, with 6s. All other values should stay NA.
Therefore, afterwards it should look like:
col1
1
NA
NA
4
NA
NA
6
6
6
6
6
8
NA
2
NA
But in reality I do not have only 15 observations, but over 50000. Therefore I need a efficient solution, which is more difficult than I thought. I tried to use the Fill function but was not able to come up with a solution.
One dplyr and zoo option could be:
df %>%
mutate(cond = na.locf0(col1) == na.locf0(col1, fromLast = TRUE),
col1 = ifelse(cond, na.locf0(col1), col1)) %>%
select(-cond)
col1
1 1
2 NA
3 NA
4 4
5 NA
6 NA
7 6
8 6
9 6
10 6
11 6
12 8
13 NA
14 2
15 NA
Here is a dply solution:
First I create the data in tibble format:
df <- tibble(
x = c(1, NA_real_, NA_real_,
4, NA_real_, NA_real_,
6, NA_real_, NA_real_, NA_real_,
6, 8, NA_real_, 2, NA_real_)
)
Next, I create two grouping variables which will be helpful in identifying the first and the last non-NA value.
I then save these reference values to ref_start and ref_end.
In the end I overwrite the values of x:
df %>%
mutate(gr1 = cumsum(!is.na(x))) %>%
group_by(gr1) %>%
mutate(ref_start = first(x)) %>%
ungroup() %>%
mutate(gr2 = lag(gr1, default = 1)) %>%
group_by(gr2) %>%
mutate(ref_end = last(x)) %>%
ungroup() %>%
mutate(x = if_else(is.na(x) & ref_start == ref_end, ref_start, x))
# A tibble: 15 x 1
x
<dbl>
1 1
2 NA
3 NA
4 4
5 NA
6 NA
7 6
8 6
9 6
10 6
11 6
12 8
13 NA
14 2
15 NA
df <- data.frame(col1 =c(1, NA, NA, 4, NA, NA, 6, NA, NA, NA, 6, 8, NA, 2, NA))
library(data.table)
library(magrittr)
setDT(df)[!is.na(col1), n := .N, by = col1] %>%
.[, n := nafill(n, type = "locf")] %>%
.[n == 2, col1 := nafill(col1, type = "locf")] %>%
.[, n := NULL] %>%
.[]
#> col1
#> 1: 1
#> 2: NA
#> 3: NA
#> 4: 4
#> 5: NA
#> 6: NA
#> 7: 6
#> 8: 6
#> 9: 6
#> 10: 6
#> 11: 6
#> 12: 8
#> 13: NA
#> 14: 2
#> 15: NA
Created on 2021-10-11 by the reprex package (v2.0.1)
Here is a tidyverse approach using dplyr and tidyr:
Logic:
Create an id column
Remove all na rows
Flag if next value is the same
right_join with first Example df
fill down flag and corresponding col1.y
mutate with an ifelse
library(dplyr)
library(tidyr)
Example <- Example %>%
mutate(id=row_number())
Example %>%
na.omit() %>%
mutate(flag = ifelse(col1==lead(col1), TRUE, FALSE)) %>%
right_join(Example, by="id") %>%
arrange(id) %>%
fill(col1.y, .direction="down") %>%
fill(flag, .direction="down") %>%
mutate(col1.x = ifelse(flag==TRUE, col1.y, col1.x), .keep="unused") %>%
select(col1 = col1.x)
Output:
col1
1 1
2 NA
3 NA
4 4
5 NA
6 NA
7 6
8 6
9 6
10 6
11 6
12 8
13 NA
14 2
15 NA
The solution above with data.table (from Yuriy Saraykin) works only for the example. As Daniel Hendrick comments : Seems as the NAs get filled after the begining and ending value, where it should really end. Like if the data would be: (6, NA, NA, 6, NA, 8) your dplyr solution would give out: (6, 6, 6, 6, 6, 8).
Here is an another proposition with data.table:
library(data.table)
df <- data.table(col1 =c(1, NA, NA, 4, NA, NA, 6, NA, NA, NA, 6, NA, NA, 8, NA, 2, NA))
cond = nafill(df$col1, type = "locf") == nafill(df$col1, type = "nocb")
df[which(cond==T), col1 := nafill(df$col1, type = "locf")[which(cond==T)]]
df$col1
[1] 1 NA NA 4 NA NA 6 6 6 6 6 NA NA 8 NA 2 NA

gather multiple columns with nested, repeated measures

I have a dataset of people (pid) of different types (type2=c("dad", "mom", "kid"; and for ease, type=c("a", "b", "c")) nested in households (hid) with repeated measurements (time).
Some variables like v1_ are asked to everyone, but the values are spread across three columns. For instance, v1_a contains the values for all of the dads (type==a).
Variables like v2_ are only asked of dads and moms (a's and b's), and the values are spread across two columns.
Variables like v3 are also only asked to dads and moms, but the values are contained in one column.
Variables like v4 are asked to everyone, and the values are contained in one column.
Have:
hid pid type type2 time v1_a v1_b v1_c v2_a v2_b v3 v4
1 1 1 a dad 1 6 NA NA 2 NA 4 3
2 1 2 b mom 1 NA 2 NA NA 5 6 6
3 1 3 c kid 1 NA NA 1 NA NA NA 5
4 2 4 a dad 1 3 NA NA 6 NA 2 6
5 2 5 b mom 1 NA 5 NA NA 2 4 3
6 2 6 c kid 1 NA NA 3 NA NA NA 5
7 1 1 a dad 2 3 NA NA 2 NA 4 3
8 1 2 b mom 2 NA 3 NA NA 5 6 6
9 1 3 c kid 2 NA NA 2 NA NA NA 5
10 2 4 a dad 2 2 NA NA 6 NA 2 6
11 2 5 b mom 2 NA 3 NA NA 2 4 3
12 2 6 c kid 2 NA NA 2 NA NA NA 5
Here is the end result I want:
hid pid type type2 time v1 v2 v3 v4
1 1 1 a dad 1 6 2 4 3
2 1 2 b mom 1 2 5 6 6
3 1 3 c kid 1 1 NA NA 5
4 2 4 a dad 1 3 6 2 6
5 2 5 b mom 1 5 2 4 3
6 2 6 c kid 1 3 NA NA 5
7 1 1 a dad 2 3 2 4 3
8 1 2 b mom 2 3 5 6 6
9 1 3 c kid 2 2 NA NA 5
10 2 4 a dad 2 2 6 2 6
11 2 5 b mom 2 3 2 4 3
12 2 6 c kid 2 2 NA NA 5
I'm looking for a tidyverse approach that will handle a larger actual use case of mixed variables as shown here. The variable naming is consistent. Where do I go after gather()?
library(tidyverse)
df_have <- data.frame(hid=c(1, 1, 1, 2, 2, 2,
1, 1, 1, 2, 2, 2),
pid=c(1, 2, 3, 4, 5, 6,
1, 2, 3, 4, 5, 6),
type=c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c"),
type2=c("dad", "mom", "kid", "dad", "mom", "kid",
"dad", "mom", "kid", "dad", "mom", "kid"),
time=c(1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2),
v1_a=c(6, NA, NA, 3, NA, NA,
3, NA, NA, 2, NA, NA),
v1_b=c(NA, 2, NA, NA, 5, NA,
NA, 3, NA, NA, 3, NA),
v1_c=c(NA, NA, 1, NA, NA, 3,
NA, NA, 2, NA, NA, 2),
v2_a=c(2, NA, NA, 6, NA, NA,
2, NA, NA, 6, NA, NA),
v2_b=c(NA, 5, NA, NA, 2, NA,
NA, 5, NA, NA, 2, NA),
v3=c(4, 6, NA, 2, 4, NA,
4, 6, NA, 2, 4, NA),
v4=c(3, 6, 5, 6, 3, 5,
3, 6, 5, 6, 3, 5)
)
df_want <- data.frame(hid=c(1, 1, 1, 2, 2, 2,
1, 1, 1, 2, 2, 2),
pid=c(1, 2, 3, 4, 5, 6,
1, 2, 3, 4, 5, 6),
type=c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c"),
type2=c("dad", "mom", "kid", "dad", "mom", "kid",
"dad", "mom", "kid", "dad", "mom", "kid"),
time=c(1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2),
v1=c(6, 2, 1, 3, 5, 3,
3, 3, 2, 2, 3, 2),
v2=c(2, 5, NA, 6, 2, NA,
2, 5, NA, 6, 2, NA),
v3=c(4, 6, NA, 2, 4, NA,
4, 6, NA, 2, 4, NA),
v4=c(3, 6, 5, 6, 3, 5,
3, 6, 5, 6, 3, 5)
)
df_have %>%
gather(key, value, -hid, -pid, -type, -type2, -time)
Here is another idea using coalesce from dplyr and map from purrr.
library(tidyverse)
# Set target column names
cols <- paste0("v", 1:4)
# Coalesce the numbers based on column names
nums <- map(cols, ~coalesce(!!!as.list(df_have %>% select(starts_with(.x)))))
# Create a data frame
nums_df <- nums %>%
setNames(cols) %>%
as_data_frame()
# Create the final output by bind_cols
df_test <- df_have %>%
select(-starts_with("v")) %>%
bind_cols(nums_df)
df_test
# hid pid type type2 time v1 v2 v3 v4
# 1 1 1 a dad 1 6 2 4 3
# 2 1 2 b mom 1 2 5 6 6
# 3 1 3 c kid 1 1 NA NA 5
# 4 2 4 a dad 1 3 6 2 6
# 5 2 5 b mom 1 5 2 4 3
# 6 2 6 c kid 1 3 NA NA 5
# 7 1 1 a dad 2 3 2 4 3
# 8 1 2 b mom 2 3 5 6 6
# 9 1 3 c kid 2 2 NA NA 5
# 10 2 4 a dad 2 2 6 2 6
# 11 2 5 b mom 2 3 2 4 3
# 12 2 6 c kid 2 2 NA NA 5
This gets me there, but the filter(!is.na(value)) step seems like a hack. Better ideas?
df_test <-
df_have %>%
gather(key, value, -hid, -pid, -type, -time, -type2) %>%
mutate(key = str_replace(key, "_.*", "")) %>%
filter(!is.na(value)) %>%
spread(key, value) %>%
arrange(time, hid, type, pid)
Update from #www:
df_test <-
df_have %>%
gather(key, value, -hid, -pid, -type, -time, -type2, na.rm=TRUE) %>%
mutate(key = str_replace(key, "_.*", "")) %>%
spread(key, value) %>%
arrange(time, hid, type, pid)

How to replace NA values between values based on multiple conditions

My zoo (time series) data set looks like below and goes on for hundreds of rows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
NA NA NA NA 1 1 1 NA NA NA 3 3 3 NA NA 1 1
cycle4I <- zoo(c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 3, 3, 3, NA, NA, 1, 1))
This variable is part of a larger zoo data set. The general pattern of this variable is a series of 1's, then NAs, then 3's, then NAs, and repeat the pattern again starting with a series of 1's. There is no regular pattern of the number of NAs.
I am trying to (i) fill the NAs between the 1's and 3's with 2, (ii) fill the NAs between the 3's and subsequent 1's with 4, and (iii) fill the NAs in the first four observations with 4 following the general pattern. When done, the values will be a series of 1, 2, 3, and 4 without a pattern of the quantity for each of the four values.
I have spent hours trying ifelse and for loops without success. (Relatively newbie with this part of R.)
I previous did this task in Stata but can't figure out the code in R to fill the NAs. The Stata code to fill the NAs is:
replace cycle4I = 2 if missing(cycle4I) & (cycle4I[_n-1] == 1 | cycle4I[_n-1] == 2) & (cycle4I[_n+1] == . | cycle4I[_n+1] == 3)
replace cycle4I = 4 if missing(cycle4I) & (cycle4I[_n-1] == 3 | cycle4I[_n-1] == 4) & (cycle4I[_n+1] == . | cycle4I[_n+1] == 1)
Here is a straightforward way:
library(zoo)
cycle4I <- zoo(c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 3, 3, 3, NA, NA, 1, 1))
x <- cycle4I
x[1] <- 3
x <- is.na(x) + na.locf(x)
x[1] <- 4
Which gives:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
4 4 4 4 1 1 1 2 2 2 3 3 3 4 4 1 1
Here is one way
library(dplyr)
library(zoo)
data_frame(cycle4I = c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 3, 3, 3, NA, NA, 1, 1)) %>%
mutate(final =
cycle4I %>%
lag %>%
na.locf(na.rm = FALSE) %>%
`+`(1) %>%
ifelse(is.na(cycle4I),
., cycle4I) )

Resources