Keep rows within a group untill condition is met in R - r

I am trying to filter the data within a group until a condition is met (in this case until status is "completed") and drop rest of the rows within a group. I've managed to come up with this ranking solution but I've ran into few issues with it when applying the code to my "real data". The function would sometimes not keep the last row (with max rank). Is there a more elegant solution to this?
Code i've used:
require(dplyr)
time <- seq(as.Date('2017/01/01'), as.Date('2017/01/15'), by="day")
set.seed(42); status <- sample(c("Completed", "On hold", "Active"), 15, replace = T)
ID <- c(rep(1, 5),rep(2, 5),rep(3, 5))
DF <- data.frame(Time = time,
Status = status,
ID = ID)
DF <- DF %>% group_by(ID) %>% mutate(ID_Rank = row_number())
DF$ID_Rank[DF$Status == "Completed"] <- max(DF$ID_Rank)+1
DF2 <- DF %>% group_by(ID) %>% filter(row_number() <= which.max(ID_Rank))

Related

Using a loop to create columns based on two data frames

I have a situation where I think a loop would be appropriate to avoid repeating chunks of code.
I have two data frames which look like the following:
patid <- seq(1,10)
date_of_session <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
date_of_referral <- sample(seq(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 10)
df1 <- data.frame(patid, date_of_session, date_of_referral)
patid1 <- sample(seq(1,10), 50, replace = TRUE)
eventdate <- sample(seq(as.Date("2010-01-01"), as.Date("2020-01-01") by = "day), 50)
comorbidity <- sample(c("hypertension", "stroke", "AF"), 50, replace = TRUE)
df2 <- data.frame(patid1, eventdate, comorbidity)
I need to repeat the following code for each comorbidity in df2 which basically generates a binary (1/0) column for each comorbidity based on whether the earliest "eventdate" (diagnosis) came before "date of session" OR "date of referral" (if "date of session" is NA) for each patient.
df_comorb <- df2 %>%
filter(comorbidity == "hypertension") %>%
group_by(patid) %>%
filter(eventdate == min(eventdate)) %>%
df1 <- left_join(df1, df2_comorb, by = "patid")
df1 <- df1 %>%
mutate(hypertension_baseline = ifelse(eventdate < date_of_session | eventdate < date_of_referral, 1, 0)) %>%
replace_na(list(hypertension_baseline = 0)) %>%
select(-eventdate)
I'd like to avoid repeating the code for each of the 27 comorbid conditions in the full dataset. I figured a loop would be the best way to repeat this for each comorbidity but I don't know how to approach writing one for this problem.
Any help would be appreciated.

Is there a way to isolate multiple index points in a dataset and isolate a time window around that index point in R?

I have a dataset with patients who were under observation for 72 hours. Patient's oxygen levels were measured every 4 seconds but some observations had to be removed due to issues with accuracy of the data. As a result, patients have a varying number of observations.
While patients were observed, they underwent various interventions. The objective of my analysis is to assess whether interventions affected the patient's oxygen levels or not. Therefore, I am doing a comparison of the oxygen levels pre-intervention, during intervention and post-intervention.
While the actual analysis isn't too difficult, I am having a hard time subsetting the data. For example, I only want observations 300 seconds prior to the start of an intervention and 300 seconds post intervention. I have to take into account the fact that an individual may have multiple interventions over the course of the time window and there are multiple subjects.
I have provided some sample code below to generate a dataset but please let me know if I'm missing anything.
id <- rep(c(1,2,3), each = 1000)
intervention <- c(rep(0,200), rep(1,10), rep(0,153), rep(0,5), rep(0,284), rep(0,20), rep(0,159), rep(0,23), rep(0,146),
rep(0,123), rep(1,23), rep(0,356), rep(1,8), rep(0,234), rep(1,23), rep(0,233),
rep(0,345), rep(1,12), rep(0,48), rep(1,15), rep(0,74), rep(1,4), rep(0,233), rep(1,82), rep(0,187))
final <- data.frame(id, intervention)
final <- final %>%
group_by(id) %>%
mutate(time = row_number() * 4)
So far, I have tried this method but I was only able to isolate single observations 5 mins pre and post an intervention and not all observations between those time windows (ie, the single observation 5mins prior to start of an intervention and the single observation 5 mins post an intervention but not all the observations in between these three points)
data <- final4 %>%
filter(intervention == 1) %>%
mutate(five_mins_after = time + 300, #5 mins after intervention
five_mins_before = time - 300) %>% #5 mins before intervention %>%
filter(id == "1")
data2 <- final4 %>%
filter(intervention == 0,
id == "1")
data_after <- data %>%
dplyr::select(five_mins_after)
data_before <- data %>%
dplyr::select(five_mins_before)
data3 <- merge(data2, data_after, by.x = "time", by.y = "five_mins_after")
data4 <- merge(data2, data_before, by.x = "time", by.y = "five_mins_before")
final <- final %>%
dplyr::bind_rows(data3) %>%
dplyr::bind_rows(data4)
Please let me know if you need any additional information and thanks for your time!
PS: Apologies if I missed anything, first time asking for help here
Here is the answer. Although long, it worked fine for gathering times 300 seconds before and 300 after the beggining of an intervention.
Let me know if you'd like further explanation or if I have misunderstood anything.
library(magrittr)
library(tidyverse)
### Sample code
id <- rep(c(1,2,3), each = 1000)
intervention <- c(rep(0,200), rep(1,10), rep(0,153), rep(0,5), rep(0,284), rep(0,20), rep(0,159), rep(0,23), rep(0,146),
rep(0,123), rep(1,23), rep(0,356), rep(1,8), rep(0,234), rep(1,23), rep(0,233),
rep(0,345), rep(1,12), rep(0,48), rep(1,15), rep(0,74), rep(1,4), rep(0,233), rep(1,82), rep(0,187))
final <- data.frame(id, intervention)
final <- final %>%
group_by(id) %>%
mutate(time = row_number() * 4)
### Start of data processing to get wanted observations
# Order it by id and time
final %<>% arrange(id, time)
# Loop over the unique ids
obs_to_keep <- list()
for(i in unique(final$id)) {
# Get starts of treatment
time_zero_intervention <- final %>%
filter(id == i & intervention == 0) %>%
select(time)
# Obtain all times after zero interventions, that could be intervention == 1
time_plus_4 <- time_zero_intervention$time + 4
# Where in the times after 0 intervention there is a 1 intervention
starts_of_interventions <- final %>%
filter(id == i & time %in% time_plus_4) %>%
filter(intervention == 1)
# Loop over each one of the times where intervention starts
all_times <- list()
for(n in 1:length(starts_of_interventions$time)) {
# Gather 300 secs prior and post
time_300_before <- starts_of_interventions$time[n] - 300
time_300_after <- starts_of_interventions$time[n] + 300
# Filter for observations in this interval
all_times[[n]] <- final %>%
filter(id == i) %>%
filter(time >= time_300_before & time <= time_300_after)
}
if(length(all_times) == 1){
obs_to_keep[[i]] <- as.data.frame(all_times)
}
else {
obs_to_keep[[i]] <- do.call(rbind, all_times)
}
}
# Make a data frame from the list
df <- do.call(rbind, obs_to_keep)
# Order it by id and time
df %<>% arrange(id, time)

Counting the number of occurrences of current pair of two IDs within a specific past time length in R

Purpose
I would like to count the number of past occurrences of each pair of two IDs within a time frame in R.
Specifically, in the below example dataset, I have 10 people who work in a same company. Two workers work as a pair, and they can work together zero to multiple times on a given date in different rooms. I would like to count how many times they worked together in the room previously within 6 months time frame of each observed date (i.e., when the data is ordered from earliest date to the latest date, how many times a pair appeared in the room within past 6 months before the current date?)
Current Progress
Also, I used two methods to calculate it after searching, and found that method 1 (ddply) generates the desired output, but method 2 (Data.Table) generates incorrect output but similar to the desired one. Because method 2 takes much less time with large observations in my original dataset, I would like to also know how to correct my method 2 code.
Comparison code at the end is provided so that you can easily compare two outputs.
I appreciate it for your help.
Dataset
library(tidyverse)
library(tibble)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10000)
room <- sample(1:5, 1000, replace=T)
set.seed(10001)
agent <- sample(1:10, 1000, replace=T)
set.seed(10000)
partner <- sample(1:10, 1000, replace=T)
set.seed(10000)
date <- sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 1000, replace=T)
df <-
data.frame(room, agent, partner, date) %>%
rowid_to_column %>%
rename(
aid = agent,
pid = partner,
o3.room = room,
o4.in = date,
oid = rowid
) %>%
filter(aid != pid) %>%
arrange(o3.room, aid, pid, o4.in) %>%
mutate(cases = 1) %>% # for cumsum in t1
mutate(o4.in_6mos = o4.in %m-% months(6)) # for t2
Method1 - ddply
t1 <-
df %>%
ddply(c('aid', 'pid', 'o3.room'), function(i){
i %>%
arrange(aid, pid, o3.room, o4.in) %>%
filter(o4.in > o4.in %m-% months(6)) %>%
mutate(j1.room = cumsum(cases)-1)
}, .progress = 'text') %>%
select(oid, o4.in, o3.room, aid, pid, j1.room) %>%
arrange(o3.room, aid, pid, o4.in)
Method2 - Data.Table
Where I modified the answers from a post in Stock Overflow.
t2 <-
df %>%
select(oid, o3.room, o4.in) %>%
cbind(
setDT(df)[df, .(j1.room = .N),
on = .(o3.room, aid, pid, o4.in < o4.in, o4.in > o4.in_6mos),
by = .EACHI] %>%
select(aid, pid, j1.room)
) %>%
arrange(o3.room, aid, pid, o4.in)
Comparison
t_compare <-
t1 %>%
select(-o4.in) %>%
rename(j1.room1 = j1.room) %>%
left_join(
t2 %>% rename(j1.room2 = j1.room),
by = c('o3.room', 'aid', 'pid', 'oid')
) %>%
arrange(o3.room, aid, pid, o4.in) %>%
mutate(j3.room = ifelse(j1.room1 != j1.room2, 'non-match', '-')) %>%
mutate(j2.room = ifelse(j1.room1 != j1.room2, '0', '1'))
To do the same steps with data.table, you could do, for example:
# used a different seed for `partner` to generate `df`
library(data.table)
library(lubridate)
ks <- c('aid', 'pid', 'o3.room')
DT <- data.table(df, key=ks)[
o4.in > o4.in %m-% months(6)][, j1.room:=cumsum(cases)-1, by=ks][
,.(oid, o4.in, o3.room, aid, pid, j1.room)]
setorder(DT, o3.room, aid, pid, o4.in)[]
# check if you get the same result:
identical(DT, as.data.table(t1))

Is there a faster way than applying 'ddply' to aggregate columns by groups with a large dataset?

Purpose
I am trying to check whether a pair of values in two columns appear in the previous event, and aggregate the dummy variables by groups.
Specifically, I have each event id (i.e., oid) and dyad-level observations associated with each event: agent (i.e., aid), partner (i.e., pid). The events are sorted by time when the event occurs (i.e., o4.in).
(1)I made a dummy variable indicating if a pair of agent and partner appear together in the previous event.
(2) Also, I used ddply to aggregate the dummy variable by groups, as specified in the below example.
I find that ddply and lag functions take so much time with a large dataset, and I am wondering if there is a faster way to achieved these tasks.
Dataset
library(tidyverse)
library(tibble)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10001)
cases <- sample(1:5, 1000, replace=T)
set.seed(10002)
agent <- sample(1:20, 1000, replace=T)
set.seed(10003)
partner <- sample(1:20, 1000, replace=T)
set.seed(123)
n <- 1000 # no of random datetimes needed
minDate <- as.POSIXct("1999/01/01")
maxDate <- as.POSIXct("2000-01-01")
epoch <- "1970-01-01"
timestamps <-
as.POSIXct(pmax(runif(n, minDate, maxDate), runif(n, minDate, maxDate)), origin = epoch)
df <-
data.frame(cases, agent, partner, timestamps) %>%
rename(
aid = agent,
pid = partner,
oid = cases,
o4.in = timestamps
) %>%
filter(aid != pid)
Current Methods
# creating dummy variable
d <-
df %>%
arrange(o4.in) %>%
group_by(aid) %>%
mutate(
oid.lag.a = lag(oid)
) %>%
ungroup %>%
group_by(pid) %>%
mutate(
oid.lag.p = lag(oid)
) %>%
ungroup %>%
mutate(
j2.consecutive = ifelse(oid.lag.a == oid.lag.p, 1, 0),
j2.consecutive = ifelse(is.na(j2.consecutive), 0, j2.consecutive)
) %>%
select(-oid.lag.a, -oid.lag.p)
# aggregating the dummy variable by groups
t <-
d %>%
ungroup %>%
ddply(c('oid', 'aid'), function(i){
i %>%
mutate(aj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1))
} , .progress = 'text') %>%
arrange(oid, pid) %>%
ddply(c('oid', 'pid'), function(i){
i %>%
mutate(apj1.consecutive = (sum(j2.consecutive) - j2.consecutive)/(n()-1))
} , .progress = 'text')
Update for Future Readers
Task (1) is achieved by the answer by #akrun below.
Task (2) solution is answered by #akrun in a separate post: A faster way than applying 'ddply' to aggregate a variable by a function by groups
Special thanks to #akrun!!
We can use data.table methods to make it faster
library(data.table)
df2 <- copy(df)
df3 <- setDT(df2)[order(o4.in)]
df3[, oid.lag.a := shift(oid), by = aid
][, oid.lag.p := shift(oid), by = pid]
df3[, j2.consecutive := fcoalesce(+(oid.lag.a == oid.lag.p), 0L)]
Also, note that some things in the OP's code are unnecessary i.e. using ifelse to convert a logical to binary. It can just be as.integer or coercion with +. The second line again with ifelse can be removed as well with coalesce
library(dplyr)
out <- df %>%
arrange(o4.in) %>%
group_by(aid) %>%
mutate(
oid.lag.a = lag(oid)
) %>%
group_by(pid) %>%
mutate(
oid.lag.p = lag(oid)
) %>%
ungroup %>%
mutate(j2.consecutive = coalesce(+(oid.lag.a == oid.lag.p), 0))
-checking the output from dplyr/data.table
all(out$j2.consecutive == df3$j2.consecutive )
[1] TRUE

How to restrict full_join() duplicates? - R

I am a novice R programmer. Below is the dataframe I am using.
I am currently running into a filtering problem with the full_join() from tidyverse.
library(tidyverse)
set.seed(1234)
df <- data.frame(
trial = rep(0:1, each = 8),
sex = rep(c('M','F'), 4),
participant = rep(1:4, 4),
x = runif(16, 1, 10),
y = runif(16, 1, 10))
df
I am currently doing the following operation to do the full_join()
df <- df %>% mutate(k = 1)
df <- df %>%
full_join(df, by = "k")
I am restricting the results to obtain the combination of points for the same participant between the trials
df2 <- filter(df, sex.x == sex.y, participant.x == participant.y, trial.x != trial.y)
df3 <- filter(df2, participant.x == 1)
df3
Here, at this step, I am running into trouble. I do not care about the order of the points. How do I condense the duplicates into one row?
Thank you
Depending on the columns you are considering, use the duplicate function. The first one will weed out duplicates based on the first 5 columns. The last one will weed out duplicates based on
df3[!duplicated(df3[,1:5]),]
df3[!duplicated(df3[,7:11]),]

Resources