Replacing all values to 1 after a condition - r

My current data is like below,
df<-data.frame(id=c(1:5),t1=c(NA,1,0,0,0),t2=c(0,1,0,1,0),
t3=c(NA,0,0,0,1),t4=c(NA,NA,NA,0,0))
And the way I'm trying to restructure this is,
for each id, if there's a "1" in that row, all the 0s in the subsequent columns would change to 1. (but leaving the NA as an NA).
So for id#1, nothing would change since there's no 1 in that row, but for id#2, after 1 in the column t2, any 0s afterwards would be replaced by 1.
i.e., this is what I'm trying to get at the end:
final<-data.frame(id=c(1:5),t1=c(0,1,0,0,0),t2=c(0,1,0,1,0),
t3=c(NA,1,0,1,1),t4=c(NA,NA,NA,1,1))
I've been trying different ways but nothing seems to work... I'd really appreciate any help!!!

In base R we can apply the cummax by row after changing the NA to a lower value and then replace the value back to NA
df[-1] <- t(apply(replace(df[-1], is.na(df[-1]), -999), 1, cummax)) *
NA^(is.na(df[-1]))
df
# id t1 t2 t3 t4
#1 1 NA 0 NA NA
#2 2 1 1 1 NA
#3 3 0 0 0 NA
#4 4 0 1 1 1
#5 5 0 0 1 1
Or use rowCummaxs from matrixStats
library(matrixStats)
df[-1] <- rowCummaxs(as.matrix(replace(df[-1], is.na(df[-1]), -999))) *
NA^(is.na(df[-1]))

With tidyverse you can try:
library(tidyverse)
df %>%
pivot_longer(cols = starts_with("t"), names_to = "Time", values_to = "Value") %>%
group_by(id) %>%
mutate(Cummax = cummax(Value)) %>%
mutate(Value = replace(Value, Value == 0 & Cummax == 1, 1)) %>%
pivot_wider(id_cols = id, names_from = "Time", values_from = "Value")
Output
# A tibble: 5 x 5
# Groups: id [5]
id t1 t2 t3 t4
<int> <dbl> <dbl> <dbl> <dbl>
1 1 NA 0 NA NA
2 2 1 1 1 NA
3 3 0 0 0 NA
4 4 0 1 1 1
5 5 0 0 1 1

Another approach in base R using apply row-wise could be to find out column number where first 1 occurs and replace all the 0 values after it with 1.
df[-1] <- t(apply(df[-1], 1, function(x) {
a_id <- which(x == 1)[1]
if(length(a_id) > 0)
replace(x, x == 0 & seq_along(x) > a_id, 1)
else x
}))
df
# id t1 t2 t3 t4
#1 1 NA 0 NA NA
#2 2 1 1 1 NA
#3 3 0 0 0 NA
#4 4 0 1 1 1
#5 5 0 0 1 1

Related

Countdown dates in R

Suppose I have the following dataset:
id1 <- c(1,1,1,1,2,2,2,2,1,1,1,1)
dates <- c("a","a","a","a","b","b","b","b","c","c","c","c")
x <- c(NA,0,NA,NA,NA,NA,0,NA,NA,NA,NA,0)
df <- data.frame(id1,dates,x)
My objective is to have a new column that explicitly tells counts the sequence of observations around 0 for every combination of id1 and dates. This would yield the following outcome:
desired_result <- c(-1,0,1,2,-2,-1,0,1,-3,-2,-1,0)
Any help is appreciated.
library(dplyr)
df %>%
group_by(id1, dates) %>%
mutate(x = row_number() - which(x == 0))
id1 dates x
1 1 a -1
2 1 a 0
3 1 a 1
4 1 a 2
5 2 b -2
6 2 b -1
7 2 b 0
8 2 b 1
9 1 c -3
10 1 c -2
11 1 c -1
12 1 c 0
With dplyr 1.1.0:
df %>%
mutate(x = row_number() - which(x == 0), .by = dates)

Find 2 out of 3 conditions per ID

I have the following dataframe:
df <-read.table(header=TRUE, text="id code
1 A
1 B
1 C
2 A
2 A
2 A
3 A
3 B
3 A")
Per id, I would love to find those individuals that have at least 2 conditions, namely:
conditionA = "A"
conditionB = "B"
conditionC = "C"
and create a new colum with "index", 1 if there are two or more conditions met and 0 otherwise:
df_output <-read.table(header=TRUE, text="id code index
1 A 1
1 B 1
1 C 1
2 A 0
2 A 0
2 A 0
3 A 1
3 B 1
3 A 1")
So far I have tried the following:
df_output = df %>%
group_by(id) %>%
mutate(index = ifelse(grepl(conditionA|conditionB|conditionC, code), 1, 0))
and as you can see I am struggling to get the threshold count into the code.
You can create a vector of conditions, and then use %in% and sum to count the number of occurrences in each group. Use + (or ifelse) to convert logical into 1 and 0:
conditions = c("A", "B", "C")
df %>%
group_by(id) %>%
mutate(index = +(sum(unique(code) %in% conditions) >= 2))
id code index
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 0
5 2 A 0
6 2 A 0
7 3 A 1
8 3 B 1
9 3 A 1
You could use n_distinct(), which is a faster and more concise equivalent of length(unique(x)).
df %>%
group_by(id) %>%
mutate(index = +(n_distinct(code) >= 2)) %>%
ungroup()
# # A tibble: 9 × 3
# id code index
# <int> <chr> <int>
# 1 1 A 1
# 2 1 B 1
# 3 1 C 1
# 4 2 A 0
# 5 2 A 0
# 6 2 A 0
# 7 3 A 1
# 8 3 B 1
# 9 3 A 1
You can check conditions using intersect() function and check whether resulting list is of minimal (eg- 2) length.
conditions = c('A', 'B', 'C')
df_output2 =
df %>%
group_by(id) %>%
mutate(index = as.integer(length(intersect(code, conditions)) >= 2))

Replace dates of many columns with 1 and NA with 0

There are many columns here, and I need to replace the dates with 1 and NA with 0. I would like a dplyr solution. thank you.
df <- data.frame(
id = c(1,2,3),
diabetes = c("12-12-2007",NA,"2-12-2018"),
lipids = c(NA,NA,"12-12-2015"),
stringsAsFactors = FALSE
)
df %>% mutate(across(-id, ~ifelse(is.na(.), 0, 1)))
id diabetes lipids
1 1 1 0
2 2 0 0
3 3 1 1
You can do :
df[-1] <- +(!is.na(df[-1]))
df
# id diabetes lipids
#1 1 1 0
#2 2 0 0
#3 3 1 1

Count occurrences of factors, comma separated, AND conditional? in R

I'm trying to do some complex calculations and part of the code requires that I parse a comma separated entry and count the number of values that are more than 0.
Example input data:
a <- c(0,0,3,0)
b <- c(4,4,0,1)
c <- c("3,4,3", "2,1", 0, "5,8")
x <- data.frame(a, b, c)
x
a b c
1 0 4 3,4,3
2 0 4 2,1
3 3 0 0
4 0 1 5,8
The column that I need to parse, c is factors and all other columns are numeric. The number of values comma separated will vary, in this example it varies from 0 to 3.
The desired output would look like this:
x$c_occur <- c(3, 2, 0, 2)
x
a b c c_occur
1 0 4 3,4,3 3
2 0 4 2,1 2
3 3 0 0 0
4 0 1 5,8 2
Where c_occur lists the number of occurrences > 0 in the c column.
I was thinking something like this would work... but I can't figure it out.
library(dplyr
x_desired <- x %>%
mutate(c_occur = count(strsplit(c, ","), > 0))
We can make use of str_count
library(stringr)
library(dplyr)
x %>%
mutate(c_occur = str_count(c, '[1-9]\\d*'))
# a b c c_occur
#1 0 4 3,4,3 3
#2 0 4 2,1 2
#3 3 0 0 0
#4 0 1 5,8 2
After splitting the 'c', we can get the count by summing the logical vector after looping over the list output from strsplit
library(purrr)
x %>%
mutate(c_occur = map_int(strsplit(as.character(c), ","),
~ sum(as.integer(.x) > 0)))
# a b c c_occur
#1 0 4 3,4,3 3
#2 0 4 2,1 2
#3 3 0 0 0
#4 0 1 5,8 2
Or we can separate the rows with separate_rows and do a group_by summarise
library(tidyr)
x %>%
mutate(rn = row_number()) %>%
separate_rows(c, convert = TRUE) %>%
group_by(rn) %>%
summarise(c_occur = sum(c >0)) %>%
select(-rn) %>%
bind_cols(x, .)
# A tibble: 4 x 4
# a b c c_occur
# <dbl> <dbl> <fct> <int>
#1 0 4 3,4,3 3
#2 0 4 2,1 2
#3 3 0 0 0
#4 0 1 5,8 2

Last observation carried forward conditional on value and colums

I have a longhitudinal dataframe with a lot of missing values that looks like this.
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
date = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
cond = c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0)
var = c(1, NA , 2, 0,NA, NA, 3, NA,0, NA, 2, NA, 1,NA,NA)
df = data.frame(ID, date, cond,var)
I would like to carry forward the last observation based on two conditions:
1) when cond=0 it should carry on the observation the higher value of the variable of interest.
2) when cond=1 it should carry forward the lower value of the variable of interest.
Does anyone have an idea on how I could do this in an elegant way?
The final dataset should look like this
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
date = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
cond = c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0)
var = c(1, 1 , 2, 0, 0, NA, 3, 3, 0, 0,2,2,2,2,2)
final = data.frame(ID, date, cond,var)
So far I was able to carry forward the last observation, but I was unable to impose the conditions
library(zoo)
df <- df %>%
group_by(ID) %>%
mutate(var =
na.locf(var, na.rm = F))
any suggestion is welcomed
This is the use of accumulate2 ie
df%>%
group_by(ID)%>%
mutate(d = unlist(accumulate2(var,cond[-1],function(z,x,y) if(y) min(z,x,na.rm=TRUE) else max(z,x,na.rm=TRUE))))
# A tibble: 15 x 5
# Groups: ID [3]
ID date cond var d
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 1 1
2 1 2 0 NA 1
3 1 3 0 2 2
4 1 4 1 0 0
5 1 5 0 NA 0
6 2 1 0 NA NA
7 2 2 0 3 3
8 2 3 0 NA 3
9 2 4 1 0 0
10 2 5 0 NA 0
11 3 1 0 2 2
12 3 2 0 NA 2
13 3 3 0 1 2
14 3 4 0 NA 2
15 3 5 0 NA 2
I think, if I understand what you are after is this?
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
date = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
cond = c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0)
var = c(1, NA , 2, 0,NA, NA, 3, NA,0, NA, 2, NA, 1,NA,NA)
df = data.frame(ID, date, cond,var)
Using case_when you can do some conditional checks. I'm unsure if you mean to return the minimum for all of the "ID" field, but this will look at the condition and then lag or lead to find a non missing value
library(dplyr)
df %>%
mutate(var_imput = case_when(
cond == 0 & is.na(var)~lag(x = var, n = 1, default = NA),
cond == 1 & is.na(var)~lead(x = var, n = 1, default = NA),
TRUE~var
))
Which yields:
ID date cond var var_imput
1 1 1 0 1 1
2 1 2 0 NA 1
3 1 3 0 2 2
4 1 4 1 0 0
5 1 5 0 NA 0
6 2 1 0 NA NA
7 2 2 0 3 3
8 2 3 0 NA 3
9 2 4 1 0 0
10 2 5 0 NA 0
11 3 1 0 2 2
12 3 2 0 NA 2
13 3 3 0 1 1
14 3 4 0 NA 1
15 3 5 0 NA NA
If you want to group by ID then you could generate an impute table by ID, then join it with the original table like this:
# enerate input table
input_table <- df %>%
group_by(ID) %>%
summarise(min = min(var, na.rm = T),
max = max(var, na.rm = T)) %>%
gather(cond, value, -ID) %>%
mutate(cond = ifelse(cond == "min", 0, 1))
# Join and impute missing
df %>%
left_join(input_table,by = c("ID", "cond")) %>%
mutate(var_imput = ifelse(is.na(var), value, var))

Resources