How to replace NA values between values based on multiple conditions - r

My zoo (time series) data set looks like below and goes on for hundreds of rows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
NA NA NA NA 1 1 1 NA NA NA 3 3 3 NA NA 1 1
cycle4I <- zoo(c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 3, 3, 3, NA, NA, 1, 1))
This variable is part of a larger zoo data set. The general pattern of this variable is a series of 1's, then NAs, then 3's, then NAs, and repeat the pattern again starting with a series of 1's. There is no regular pattern of the number of NAs.
I am trying to (i) fill the NAs between the 1's and 3's with 2, (ii) fill the NAs between the 3's and subsequent 1's with 4, and (iii) fill the NAs in the first four observations with 4 following the general pattern. When done, the values will be a series of 1, 2, 3, and 4 without a pattern of the quantity for each of the four values.
I have spent hours trying ifelse and for loops without success. (Relatively newbie with this part of R.)
I previous did this task in Stata but can't figure out the code in R to fill the NAs. The Stata code to fill the NAs is:
replace cycle4I = 2 if missing(cycle4I) & (cycle4I[_n-1] == 1 | cycle4I[_n-1] == 2) & (cycle4I[_n+1] == . | cycle4I[_n+1] == 3)
replace cycle4I = 4 if missing(cycle4I) & (cycle4I[_n-1] == 3 | cycle4I[_n-1] == 4) & (cycle4I[_n+1] == . | cycle4I[_n+1] == 1)

Here is a straightforward way:
library(zoo)
cycle4I <- zoo(c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 3, 3, 3, NA, NA, 1, 1))
x <- cycle4I
x[1] <- 3
x <- is.na(x) + na.locf(x)
x[1] <- 4
Which gives:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
4 4 4 4 1 1 1 2 2 2 3 3 3 4 4 1 1

Here is one way
library(dplyr)
library(zoo)
data_frame(cycle4I = c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 3, 3, 3, NA, NA, 1, 1)) %>%
mutate(final =
cycle4I %>%
lag %>%
na.locf(na.rm = FALSE) %>%
`+`(1) %>%
ifelse(is.na(cycle4I),
., cycle4I) )

Related

Create new variable based on outcome of other variable in group - R

This a similar/followup question to this R: How to code new variable based on grouped variable and conditioned on earlier row but it is different because within donors there are potentially two match runs.
I have a data file with organ donors. I'm looking at lungs that are donated - there are two lungs.
If the lungs are split (L and R) and put up for donation, they are each attempted to match with recipients ("matchrun"). They go through eligible recipients until one matches ("sequence").
If the lung is matched to a recipient, it goes to them ("organ_placed").
If the lung doesn't match, it continues in the sequence and then just remains NA at the maximum sequence number.
I would like to create a new variable that has the outcome of the match run such that if one lung is placed and the other is not, it tells you that the lung was discarded. i.e. see case of Donor 2 in the data - the left lung is placed, but the right doesn't match.
In donor 3, the first match run doesn't match but the match run for the other lung does.
I figure it will be something like group_by(donorid, matchrun) but then how do you make a condition based on the match run?
library(tribble)
library(dplyr)
data <- tribble(
~donorid, ~matchrun, ~sequence, ~organ_placed,
2, 3, 1, NA,
2, 3, 2, NA,
2, 3, 3, "L",
2, 4, 1, NA,
2, 4, 2, NA,
2, 4, 3, NA,
3, 5, 1, NA,
3, 5, 1, NA,
3, 5, 1, NA,
3, 6, 1, NA,
3, 6, 2, NA,
3, 6, 3, "L"
)
desired_outcome <- tribble(
~donorid, ~matchrun, ~sequence, ~organ_placed, ~organ,
2, 3, 1, NA, NA,
2, 3, 2, NA, NA,
2, 3, 3, "L", "Left Single",
2, 4, 1, NA, NA,
2, 4, 2, NA, NA,
2, 4, 3, NA, "Right Discarded",
3, 5, 1, NA, NA,
3, 5, 1, NA, NA,
3, 5, 1, NA, "Right Discarded",
3, 6, 1, NA, NA,
3, 6, 2, NA, NA,
3, 6, 3, "L", "Left Single")
You can try this:
data %>%
group_by(donorid) %>%
mutate(temp = ifelse(n_distinct(organ_placed, na.rm = TRUE) == 1, unique(na.omit(organ_placed)), "B")) %>%
group_by(matchrun, .add = TRUE) %>%
mutate(organ = case_when(organ_placed == "L" ~ "Left Single",
organ_placed == "R" ~ "Right Single",
all(is.na(organ_placed)) & row_number() == max(sequence) & temp == "L" ~ "Right Discarded",
all(is.na(organ_placed)) & row_number() == max(sequence) & temp == "R" ~ "Left Discarded")) %>%
ungroup()
output
donorid matchrun sequence organ_placed temp organ
1 1 1 1 NA B NA
2 1 1 2 NA B NA
3 1 1 3 L B Left Single
4 1 2 1 NA B NA
5 1 2 2 NA B NA
6 1 2 3 R B Right Single
7 2 3 1 NA L NA
8 2 3 2 NA L NA
9 2 3 3 L L Left Single
10 2 4 1 NA L NA
11 2 4 2 NA L NA
12 2 4 3 NA L Right Discarded
Update: we have to add matchrun to the group. Removed prior solution:
data %>%
group_by(donorid, matchrun) %>%
mutate(outcome = case_when(organ_placed == "L" ~ "Left Single",
organ_placed == "R" ~ "Right Single",
organ_placed == "B" ~ "Bilateral",
(is.na(organ_placed) &
row_number() == max(row_number())) &
"L" %in% organ_placed ~ "Right Discarded",
(is.na(organ_placed) &
row_number() == max(row_number())) &
"R" %in% organ_placed ~ "Left Discarded",
TRUE ~ NA_character_))
Groups: donorid, matchrun [4]
donorid matchrun sequence organ_placed outcome
<dbl> <dbl> <dbl> <chr> <chr>
1 2 3 1 NA NA
2 2 3 2 NA NA
3 2 3 3 L Left Single
4 2 4 1 NA NA
5 2 4 2 NA NA
6 2 4 3 NA NA
7 3 5 1 NA NA
8 3 5 1 NA NA
9 3 5 1 NA NA
10 3 6 1 NA NA
11 3 6 2 NA NA
12 3 6 3 L Left Single
We can use
library(data.table)
library(stringr)
setDT(data)[, seq2 := rowid(donorid, matchrun) ]
data[, organ := str_replace_all(organ_placed,
setNames(c("Left Single", "Right Single"), c("L", "R")))]
data[seq2 == max(seq2),
organ := fcase(!is.na(organ), organ, default =
str_replace_all(setdiff(c("Left Single", "Right Single"), organ),
setNames(c("Left Discarded", "Right Discarded"),
c("Left Single", "Right Single")))), donorid
][, seq2 := NULL][]
-output
> data
donorid matchrun sequence organ_placed organ
1: 2 3 1 <NA> <NA>
2: 2 3 2 <NA> <NA>
3: 2 3 3 L Left Single
4: 2 4 1 <NA> <NA>
5: 2 4 2 <NA> <NA>
6: 2 4 3 <NA> Right Discarded
7: 3 5 1 <NA> <NA>
8: 3 5 1 <NA> <NA>
9: 3 5 1 <NA> Right Discarded
10: 3 6 1 <NA> <NA>
11: 3 6 2 <NA> <NA>
12: 3 6 3 L Left Single

Remove data.frame columns if all values in a certain range, pre-determined by a group, are NA

I would like to remove all data.frame columns, where ALL of its values are NA. Additionally, I want to exclude those columns, where only for a certain number of rows (defined by a grouping variable) are NA.
Example:
library(tidyverse)
df <- tribble(~group, ~id, ~x, ~y, ~z,
"1", 1, 1, 1, NA,
"1", 2, NA, 1, NA,
"2", 1, 3, 1, NA,
"2", 2, NA, 1, NA,
"3", 1, 5, 1, NA,
"3", 2, NA, 1, NA,
"4", 1, 7, NA, 1,
"4", 2, NA, NA, 1,
"5", 1, 9, NA, NA,
"5", 2, NA, NA, NA)
In this example, I would like to keep xand z, but remove y. That is because for the groups 4 and 5, its values are all NA. The other two columns also have a lot of missing values, but crucially have at least some values in groups 4 to 5. Ideally, I would like to do that with {dplyr}.
Code / Problem:
I'm kind of stuck with this code, which, for obvious reasons, doesn't work.
df %>%
mutate(group_new = ifelse(group=="4"|group=="5", 1, 0)) %>%
group_by(group_new) %>%
select_if(function(col) all(!is.na(col)))
There are several problems with this, which is that I would like to avoid creating a new grouping variable, and of course the fact that select_if doesn't take the group_by() condition into account but rather looks for complete columns (of which there are none).
Any help with this is much appreciated!
library(dplyr)
df %>%
select(-where(~ all(is.na(.x[df$group %in% 4:5]))))
output
# A tibble: 10 × 4
group id x z
<chr> <dbl> <dbl> <dbl>
1 1 1 1 NA
2 1 2 NA NA
3 2 1 3 NA
4 2 2 NA NA
5 3 1 5 NA
6 3 2 NA NA
7 4 1 7 1
8 4 2 NA 1
9 5 1 9 NA
10 5 2 NA NA

Get last entry of a range with identical numbers in R, vectorized

I’ve got this data:
tribble(
~ranges, ~last,
0, NA,
1, NA,
1, NA,
1, NA,
1, NA,
2, NA,
2, NA,
2, NA,
3, NA,
3, NA
)
and I want to fill the last column only at the row index at the last entry of the number by the ranges column. That means, it should look like this:
tribble(
~ranges, ~last,
0, 0,
1, NA,
1, NA,
1, NA,
1, 1,
2, NA,
2, NA,
2, 2,
3, NA,
3, 3
)
So far I came up with a row-wise approach:
for (r in seq.int(max(tmp$ranges))) {
print(r)
range <- which(tmp$ranges == r) |> max()
tmp$last[range] <- r
}
The main issue is that it is terribly slow. I am looking for a vectorized approach to this issue. Any creative solution out there?
Here's a dplyr solution:
library(dplyr)
tmp %>%
group_by(ranges) %>%
mutate(
last = case_when(row_number() == n() ~ ranges, TRUE ~ NA_real_)
) %>%
ungroup()
# # A tibble: 10 × 2
# ranges last
# <dbl> <dbl>
# 1 0 0
# 2 1 NA
# 3 1 NA
# 4 1 NA
# 5 1 1
# 6 2 NA
# 7 2 NA
# 8 2 2
# 9 3 NA
# 10 3 3
Or we could do something clever with base R for the same result. Here we calculate the difference of ranges to identify when the next row is different (i.e., the last of a group). We then stick a TRUE on the end so the last row is included. This assumes your data is already sorted by ranges.
tmp$last = ifelse(c(diff(tmp$ranges) != 0, TRUE), tmp$ranges, NA)
Using replace:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = replace(last, n(), ranges[n()]))
Using ifelse:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = ifelse(row_number() == n(), ranges, NA))
Using tail:
library(dplyr)
df %>%
group_by(ranges) %>%
mutate(last = c(last[-n()], tail(ranges, 1)))
output
ranges last
<dbl> <dbl>
1 0 0
2 1 NA
3 1 NA
4 1 NA
5 1 1
6 2 NA
7 2 NA
8 2 2
9 3 NA
10 3 3

Changing NA value if matching a defined length

I have this kind of data :
daynight
[1] NA NA NA NA 2 1 NA NA
I want R to detect if there is a series of at least x NA and replace these by another value.
For example if x=3 and the replacement value is 3 I want R to give me in output :
daynight
[1] 3 3 3 3 2 1 NA NA
Would you have any ideas?
We can use rle
daynight <- c(NA, NA, NA, NA ,2 ,1, NA, NA)
x <- 3
r <- 3
daynight[with(rle(is.na(daynight)), rep(lengths >= x & values, lengths))] <- r
daynight
#[1] 3 3 3 3 2 1 NA NA
Taking another example :
daynight <- c(NA, NA, NA, 3,2,1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
daynight[with(rle(is.na(daynight)), rep(lengths >= x & values, lengths))] <- r
#[1] 3 3 3 3 2 1 NA NA 1 3 3 3 1 NA NA
And here is another solution using the zoo package
library(zoo)
replace_consecutive_NAs <- function(x, nrNAs = 3, replaceBy = nrNAs){
x <- as.numeric(is.na(x))
indexes <- (rollapply(x, 3, prod, fill = 0, align = "left") +
rollapply(x, 3, prod, fill = 0, align = "right")) != 0
x[indexes] <- replaceBy
x
}
x <- c(NA, NA, NA, NA ,2 ,1, NA, NA)
replace_consecutive_NAs(x, 3, 999)
[1] 999 999 999 999 2 1 NA NA

R Calculating difference between values in a column

How can I calculate pairwise difference between values in one column?
The calculation should start with the first two values, and should be continued with the next two values as it is done in column "desired_result" here:
data <- data.frame(data = c(5, NA, NA, NA, 3, NA, NA, 4, NA, 3, NA, NA, NA, 6, 1, 4, NA, 2))
Here's a one-liner:
data$desired_result[which(!is.na(data$data))[c(FALSE, TRUE)]] <-
rev(diff(rev(na.omit(data$data))))[c(TRUE, FALSE)]
where which(!is.na(data$data)) finds non-NA entries of data$data and then adding c(FALSE, TRUE) chooses only every second one. Also, na.omit(data$data) discards NA values, rev reverses this vector, diff takes differences, rev reverses the vector back to the correct order, and, lastly, since we don't want all the differences, I again choose every second with c(TRUE, FALSE).
Same as Julius but even shorter and faster:
data$desired_result[which(!is.na(data$data))[c(FALSE, TRUE)]] <-
diff(na.omit(data$data))[c(TRUE, FALSE)] * -1
since diff() calculates x1 - x0, both rev() can be replace by diff() * -1
Speed comparison using microbenchmark:
Unit: microseconds
expr min lq mean median uq max neval cld
julius 38.096 43.757 51.44687 46.143 50.8655 170511.851 1e+05 b
this 32.828 37.501 43.02233 39.548 43.4390 7405.489 1e+05 a
if you want to have a result exactly like you described here
you can use:
> data <- data.frame(data = c(5, NA, NA, NA, 3, NA, NA, 4, NA, 3, NA,
> NA, NA, 6, 1, 4, NA, 2)) %>% mutate(index = 1:n())
>
> ex = data %>% filter(!is.na(data))
>
> df2 = data.frame(index = rollapply(ex$index, width = 2, by = 2, last),
> desired_results = rollapply(ex$data, width = 2, by = 2, FUN = function (x) -1*diff(x)))
>
> data2 = left_join(data, df2, by = "index") %>% select(-index)
data desired_results
1 5 NA
2 NA NA
3 NA NA
4 NA NA
5 3 2
6 NA NA
7 NA NA
8 4 NA
9 NA NA
10 3 1
11 NA NA
12 NA NA
13 NA NA
14 6 NA
15 1 5
16 4 NA
17 NA NA
18 2 2
but if you just want the difference then you can use:
rollapply(na.omit(data$data), by = 2, width = 2, diff)
beware that you'll get negative results: -2 -1 -5 -2

Resources