R carry forward and backward conditionally - r

I have a set of variables that contain data about whether or not a person has ever had certain health conditions. For example, "have you ever had a heart attack?"
If they say "yes" at observation 2, then the answer is still yes at observations 3 and 4. But, it is not necessarily yes at observation 1. The heart attack could have occurred between observation 1 and 2.
If they say "no" at observation 2, then the answer is no at observations 1. But, it is not necessarily no at observations 3 or 4.
Here is a reproducible example:
df <- tibble(
id = rep(1:3, each = 4),
obs = rep(1:4, times = 3),
mi_ever = c(NA, 0, 1, NA, NA, 0, NA, NA, NA, 1, NA, NA)
)
df
id obs mi_ever
1 1 1 NA
2 1 2 0
3 1 3 1
4 1 4 NA
5 2 1 NA
6 2 2 0
7 2 3 NA
8 2 4 NA
9 3 1 NA
10 3 2 1
11 3 3 NA
12 3 4 NA
It's trivial to carry my 0's (No's) backward or carry my 1's (Yes's) forward using zoo::na.locf. However, I'm not sure how to carry 0's backward and 1's forward. Ideally, I'd like the following result:
id obs mi_ever mi_ever_2
1 1 1 NA 0
2 1 2 0 0
3 1 3 1 1
4 1 4 NA 1
5 2 1 NA 0
6 2 2 0 0
7 2 3 NA NA
8 2 4 NA NA
9 3 1 NA NA
10 3 2 1 1
11 3 3 NA 1
12 3 4 NA 1
I've checked out the following posts, but none seem to cover exactly what I'm asking here.
Carry last Factor observation forward and backward in group of rows in R
Forward and backward fill data frame in R
making a "dropdown" function in R
Any help is appreciated.

Basically I'm marking the items in sequence after the first 1 to become 1 and the ones before the last 0 to become 0.
ever <- function (x) min( which( x == 1))
NA_1 <- function(x) seq_along(x) > ever(x) #could have done in one function
# check to see if working
ave(df$mi_ever, df$id, FUN= function(x){ x[NA_1(x) ] <- 1; x})
[1] NA 0 1 1 NA 0 NA NA NA 1 1 1
NA_0 <- function(x) seq_along(x) < not_yet(x)
not_yet <- function(x){ max( which( x==0)) }
# make temporary version of 1-modified column
temp1 <- ave(df$mi_ever, df$id, FUN= function(x){ x[NA_1(x) ] <- 1; x})
df$ever2 <- ave(temp1, df$id, FUN= function(x){ x[NA_0(x) ] <- 0; x})
# then make final version; could have done it "in place" I suppose.
df
# A tibble: 12 x 4
id obs mi_ever ever2
<int> <int> <dbl> <dbl>
1 1 1 NA 0
2 1 2 0 0
3 1 3 1 1
4 1 4 NA 1
5 2 1 NA 0
6 2 2 0 0
7 2 3 NA NA
8 2 4 NA NA
9 3 1 NA NA
10 3 2 1 1
11 3 3 NA 1
12 3 4 NA 1
If you need to suppress the warnings that should be possible.

I took the answer from #42- above (Thank you!), and tweaked it a little bit to further suit my needs. Specifically, I:
Took care of the warning "no non-missing arguments to min; returning Infno non-missing arguments to max; returning -Inf".
Combined the separate functions into a single function (although the separate functions were extremely useful for learning).
Added an optional check_logic argument. When TRUE, the function will return 9's if a 0 comes after a 1. This represents a data error or logic flaw that warrants further investigation.
Added an example of using the function with data.table, and on multiple variables at once. This more accurately represents how I'm using the function in real life, and I thought it may be useful to others.
The function:
distribute_ever <- function(x, check_logic = TRUE, ...) {
if (check_logic) {
if (length(which(x == 1)) > 0 & length(which(x == 0)) > 0) {
if (min(which(x == 1)) < max(which(x == 0))) {
x <- 9 # Set x to 9 if zero comes after 1
}
}
}
ones <- which(x == 1) # Get indices for 1's
if (length(ones) > 0) { # Prevents warning
first_1_by_group <- min(which(x == 1)) # Index first 1 by group
x[seq_along(x) > first_1_by_group] <- 1 # Set x at subsequent indices to 1
}
zeros <- which(x == 0) # Get indices for 0's
if (length(zeros) > 0) { # Prevents warning
last_0_by_group <- max(which(x == 0)) # Index last 0 by group
x[seq_along(x) < last_0_by_group] <- 0 # Set x at previous indices to 0
}
x
}
A new reproducible example with multiple "ever" variables and some cases with 0 after 1:
dt <- data.table(
id = rep(1:3, each = 4),
obs = rep(1:4, times = 3),
mi_ever = c(NA, 0, 1, NA, NA, 0, NA, NA, NA, 1, NA, NA),
diab_ever = c(0, NA, NA, 1, 1, NA, NA, 0, 0, NA, NA, NA)
)
Iterate over multiple variables quickly using data.table (with by group processing):
ever_vars <- c("mi_ever", "diab_ever")
dt[, paste0(ever_vars, "_2") := lapply(.SD, distribute_ever),
.SDcols = ever_vars,
by = id][]
Results:
id obs mi_ever diab_ever mi_ever_2 diab_ever_2
1: 1 1 NA 0 0 0
2: 1 2 0 NA 0 NA
3: 1 3 1 NA 1 NA
4: 1 4 NA 1 1 1
5: 2 1 NA 1 0 9
6: 2 2 0 NA 0 9
7: 2 3 NA NA NA 9
8: 2 4 NA 0 NA 9
9: 3 1 NA 0 NA 0
10: 3 2 1 NA 1 NA
11: 3 3 NA NA 1 NA
12: 3 4 NA NA 1 NA
For each input "ever" variable, we have:
Created a new variable with "_2" appended to the end of the input variable name. You could also edit "in place" as 42- pointed out, but I like being able to double check my data.
Zeroes are carried backward and ones are carried forward in time.
NA's after zeros and before ones (within id) are returned unchanged.
If there is a 0 (No, I've never had ...) after a 1 (Yes, I've had ...), as is the case with person 2's responses regarding diabetes, then the function returns 9's.
If we were to set check_logic to FALSE, then 1's would win out and replace 0's

Related

Ifelse across multiple columns matching on similar attributes

I need to create a binary variable called dum, (perhaps using an ifelse statement) matching on the number of the column names.
ifelse f[number] %in% c(4:6) & l[number]==1, 1, else 0
f1<-c(3,2,1,6,5)
f2<-c(4,1,5,NA,NA)
f3<-c(5,3,4,NA,NA)
f4<-c(1,2,4,NA,NA)
l1<-c(1,0,1,0,0)
l2<-c(1,1,1,NA,NA)
l3<-c(1,0,0,NA,NA)
l4<-c(0,0,0,NA,NA)
mydata<-data.frame(f1,f2,f3,f4,l1,l2,l3,l4)
dum is 1 if f1 contains values between 4, 5, 6 AND l1 contains a value of 1, OR f2 contains values between 4, 5, 6 AND l2 contains a value of 1, and so on.
In essence, the expected output should be
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
I can only think of doing it in a very long way such as
mutate(dum=ifelse(f1 %in% c(4:6 & l1==1, 1,
ifelse(f2 %in% c(4:6) & l2==1, 1,
ifelse(f3 %in% c(4:6) & l3==1, 1,
ifelse(f4 %in% c(4:6) & l4==1, 1, 0))))
But this is burdensome since the real data has many more columns than that and can go up to f20 and l20.
Is there a more efficient way to do this?
Here is one suggestion. Again it is not exactly clear. Assuming you want one column with dum that indicates the presences of the number in the column names in that row in any of the columns:
library(dplyr)
library(readr)
mydata %>%
mutate(across(f1:l4, ~case_when(. == parse_number(cur_column()) ~ 1,
TRUE ~ 0), .names = 'new_{col}')) %>%
mutate(sumNew = rowSums(.[9:16])) %>%
mutate(dum = ifelse(sumNew >=1, 1, 0)) %>%
select(1:8, dum)
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 1
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
Here is one option with across - loop across the 'f' columns, use the first condition, loop across the 'l' columns' with the second condition applied, join them together with & to return a logical matrix, get the row wise sum of the columns (TRUE -> 1 and FALSE -> 0), check if that sum is greater than 0 (i.e. if there are any TRUE in that row), and coerce the logical to binary with + or as.integer
library(dplyr)
mydata %>%
mutate(dum = +(rowSums(across(starts_with('f'), ~.x %in% 4:6) &
across(starts_with('l'), ~ .x %in% 1)) > 0))
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
We could also use base R
mydata$dum <- +(Reduce(`|`, Map(function(x, y) x %in% 4:6 &
y %in% 1, mydata[startsWith(names(mydata), "f")],
mydata[startsWith(names(mydata), "l")])))
Here's an approach multiplying two mapplys together, columns identified with grep, then calculating rowSums > 0. If you set na.rm=F you could get NAs in respective rows.
as.integer(rowSums(mapply(`%in%`, mydata[grep('^f', names(mydata))], list(4:6))*
mapply(`==`, mydata[grep('^l', names(mydata))], 1), na.rm=T) > 0)
# [1] 1 0 1 0 0
If f* and l* each aren't consecutive, rather use sort(grep(., value=T)).

Lagging vector adding 1 while resetting to 0 when a condition is met

I have a sequence of treatments, one per day (binary), say:
trt <- c(0, 0, 1, 0, 0, 0, 1, 0, 0)
I want to create a vector, days_since, that:
Is NA up until the first treatment.
Is 0 where trt is 1
Counts the days since the last treatment
So, the output days_since should be:
days_since <- c(NA, NA, 0, 1, 2, 3, 0, 1, 2)
How would I do this in R? To get days_since, I basically need to lag by one element and add 1, but resetting every time the original vector (trt) is 1. If this is doable without a for-loop, that would be ideal, but not absolutely necessary.
Maybe you can try the code below
v <- cumsum(trt)
replace(ave(trt,v,FUN = seq_along)-1,v<1,NA)
which gives
[1] NA NA 0 1 2 3 0 1 2
Explanation
First, we apply cumsum over trt to group the treatments
> v <- cumsum(trt)
> v
[1] 0 0 1 1 1 1 2 2 2
Secondly, using ave helps to add sequential indices within each group
> ave(trt,v,FUN = seq_along)-1
[1] 0 1 0 1 2 3 0 1 2
Finally, since the value is NA before the first treatment, it means all the value before v == 1 appears should be replaced by NA. Thus we use replace, and the index logic follows v < 1
> replace(ave(trt,v,FUN = seq_along)-1,v<1,NA)
[1] NA NA 0 1 2 3 0 1 2
We can also use
(NA^!cummax(trt)) * sequence(table(cumsum(trt)))-1
#[1] NA NA 0 1 2 3 0 1 2
Or with rowid from data.table
library(data.table)
(NA^!cummax(trt)) *rowid(cumsum(trt))-1
#[1] NA NA 0 1 2 3 0 1 2

How to do group matching in R?

Suppose I have the data.frame below where treat == 1 means that the id received treatment and prob is the calculated probability that treat == 1.
set.seed(1)
df <- data.frame(id = 1:10, treat = sample(0:1, 10, replace = T))
df$prob <- ifelse(df$treat, rnorm(10, .8, .1), rnorm(10, .4, .4))
df
id treat prob
1 1 0 0.3820266
2 2 0 0.3935239
3 3 1 0.8738325
4 4 1 0.8575781
5 5 0 0.6375605
6 6 1 0.9511781
7 7 1 0.8389843
8 8 1 0.7378759
9 9 1 0.5785300
10 10 0 0.6479303
To minimize selection bias, I now wish to create pseudo treatment and control groups on the basis of the values of treat and prob:
When any id withtreat == 1 is within 0.1 prob of any id with treat == 0, I want the value of group to be "treated".
When any id withtreat == 0 is within 0.1 prob of any id with treat == 1, I want the value of group to be "control".
Below is an example of what I'd like the result to be.
df$group <- c(NA, NA, NA, NA, 'control', NA, NA, 'treated', 'treated', 'control')
df
id treat prob group
1 1 0 0.3820266 <NA>
2 2 0 0.3935239 <NA>
3 3 1 0.8738325 <NA>
4 4 1 0.8575781 <NA>
5 5 0 0.6375605 control
6 6 1 0.9511781 <NA>
7 7 1 0.8389843 <NA>
8 8 1 0.7378759 treated
9 9 1 0.5785300 treated
10 10 0 0.6479303 control
How would I go about doing this? In the example above, matching is done with replacements, but a solution without replacements would be welcome, too.
You can try
foo <- function(x){
TR <- range(x$prob[x$treat == 0])
CT <- range(x$prob[x$treat == 1])
tmp <- sapply(1:nrow(x), function(y, z){
if(z$treat[y] == 1){
ifelse(any(abs(z$prob[y] - TR) <= 0.1), "treated", "NA")
}else{
ifelse(any(abs(z$prob[y] - CT) <= 0.1), "control", "NA")
}}, x)
cbind(x, group = tmp)
}
foo(df)
id treat prob group
1 1 0 0.3820266 NA
2 2 0 0.3935239 NA
3 3 1 0.8738325 NA
4 4 1 0.8575781 NA
5 5 0 0.6375605 control
6 6 1 0.9511781 NA
7 7 1 0.8389843 NA
8 8 1 0.7378759 treated
9 9 1 0.5785300 treated
10 10 0 0.6479303 control
I think this problem is well suited for cut in base R. Here is how you can do it in a vectorized way:
f <- function(r) {
x <- cut(df[r,]$prob, breaks = c(df[!r,]$prob-0.1, df[!r,]$prob+0.1))
df[r,][!is.na(x),]$id
}
ones <- df$treat==1
df$group <- NA
df[df$id %in% f(ones),]$group <- "treated"
df[df$id %in% f(!ones),]$group <- "control"
> df
# id treat prob group
# 1 1 0 0.3820266 <NA>
# 2 2 0 0.3935239 <NA>
# 3 3 1 0.8738325 <NA>
# 4 4 1 0.8575781 <NA>
# 5 5 0 0.6375605 control
# 6 6 1 0.9511781 <NA>
# 7 7 1 0.8389843 <NA>
# 8 8 1 0.7378759 treated
# 9 9 1 0.5785300 treated
# 10 10 0 0.6479303 control
Perhaps not the most elegant but it seems to work for me:
df %>% group_by(id,treat) %>% mutate(group2 = ifelse(treat==1,
ifelse(any(abs(prob-df[df$treat==0,3])<0.1),"treated","NA"),
ifelse(any(abs(prob-df[df$treat==1,3])<0.1),"control","NA"))) # treat==0
Is this what you want?
#Base R:
apply(df[df$treat == 1, ],1, function(x){
ifelse(any(df[df$treat == 0, 'prob'] -.1 < x[3] & x[3] < df[df$treat == 0, 'prob'] +.1), 'treated', NA)
})
You can invert $treatclause to reflect control-group and attach the variables to your df.

How to replace consecutive NAs with zero given a max gap parameter (in R)

I would like to replace all consecutive NA values per row with zero but only if the number of consecutive NAs is less than a parmeter maxgap.
This is very similar to the function zoo::na.locf
x = c(NA,1,2,3,NA,NA,5,6,7,NA,NA,NA)
zoo::na.locf(x, maxgap = 2, na.rm = FALSE)
gives
[1] NA 1 2 3 3 3 5 6 7 NA NA NA
There are two things different from my aim:
I would like to replace the leading NA too and I would like to replace the 2 consecutive NAs with 0 and not the last non-NA value.
I would like to get
0 1 2 3 0 0 5 6 7 NA NA NA
How can I do this in R. Can I use functions from the tidyverse?
If y is the result of the na.locf line then if y[i] is not NA but x[i] is NA then it was replaced so assign 0 to it. Also if it is a leading NA which occurs when the cumsum(...) term below is 0 then replace it as well.
replace(y, (!is.na(y) & is.na(x)) | cumsum(!is.na(y)) == 0, 0)
## [1] 0 1 2 3 0 0 5 6 7 NA NA NA
We can use rle to do this
f1 <- function(vec){
rl <- rle(is.na(vec))
lst <- within.list(rl, {
i1 <- seq_along(values)==1
i2 <- seq_along(values) != length(values)
values[!((lengths==2 & values & i2)|
(values & i1))] <- FALSE
})
vec[inverse.rle(lst)] <- 0
vec
}
f1(x)
#[1] 0 1 2 3 0 0 5 6 7 NA NA NA
You could e.g. do this:
require(data.table)
require(dplyr)
x = c(NA,1,2,3,NA,NA,5,6,7,NA,NA,NA)
my_replace <- function(x, n, maxgap){
if(is.na(x[1]) && n <= maxgap){
x <- 0
}
x
}
data.frame(x, y=x) %>%
group_by(data.table::rleid(x)) %>%
mutate(x = my_replace(x, n(), 2), y = my_replace(y, n(), 1)) %>%
ungroup() %>%
select(x,y)
This allows you to set the maxgap columnwise: for x 2 for y 1.
This results in:
# A tibble: 12 × 2
x y
<dbl> <dbl>
1 0 0
2 1 1
3 2 2
4 3 3
5 0 NA
6 0 NA
7 5 5
8 6 6
9 7 7
10 NA NA
11 NA NA
12 NA NA

counting particular number repetition in r

Here is small example:
A <- c(1,1,1,1, 0, 0, 0, 2,2,2)
B <- c(1,1,1,1, 0, 0, 0, 0,2,2)
C <- c(1,1,3,3, 0,0, 2,2,2, NA)
myd <- data.frame (A, B, C)
I need to apply a function say "prod" (prod (myd$myvar, na.rm = TRUE), before applying I need to count number of 0's.
(1) If number zeros are equal to or less than 3, I need to replace with NA
myd$A[myd$A ==0] <- NA
(2) If number of zeros are greater than 3, no replacement action need to be done.
myd$B[myd$B ==0] <- 0
How can I count zeros and apply the coditions to get the results.
Edit:
In the above dataset, A and C meets condition 1 and B condition 2.
Are you looking for something like this?
f <- function(X) {
if(sum(X==0, na.rm=TRUE) <= 3) X[X==0] <- NA
X
}
data.frame(lapply(myd, f))
# A B C
# 1 1 1 1
# 2 1 1 1
# 3 1 1 3
# 4 1 1 3
# 5 NA 0 NA
# 6 NA 0 NA
# 7 NA 0 2
# 8 2 0 2
# 9 2 2 2
# 10 2 2 NA

Resources