Extract data from table with condition - r

I have data. Here's an example:
A tibble: 1,296 x 4
id treatmentstart protocoltype PDL1_date
<dbl> <chr> <chr> <chr>
1 1111 05/11/2020 Chemoradiation 05/03/2020
2 22222 03/03/2021 Chemo plus PD-1 plus CTLA-4 01/03/2020
3 333333 08/04/2018 Anti-VEGF plus Chemo NA
4 444444 07/06/2019 Chemoradiation 03/08/2018
5 555555 09/12/2020 Chemo plus PDl-1 07/11/2020
6 666666 05/06/2018 PD-1 08/02/2017
7 666666 07/07/2018 Chemotherapy 08/02/2017
8 777777 07/05/2019 Chemotherapy 06/03/2020
9 999999 08/08/2018 Chemoradiation 08/05/2020
10 999999 12/07/2017 PDL-1 08/05/2020
As you can see, some of the IDs are repeated, but have different treatments (type of protocol)
I need to extract the IDs that meet the following conditions:
Test date, earlier treatment start date, and treatment type all that include "PD1" or "PDL1", if the ID has multiple treatments then I need to compare treatment dates and choose the earliest treatment date and compare with the test date, if the test earlier then it fits, if not then not.
In conclusion: only those who have a test date before a certain type of treatment ("PD1" or "PDL1") and have not received any other treatment before the test date should be selected. Here is an example of what should come up:
A tibble: 1,296 x 4
id treatmentstart protocoltype PDL1_date
<dbl> <chr> <chr> <chr>
1 22222 03/03/2021 Chemo plus PD-1 plus CTLA-4 01/03/2020
2 555555 09/12/2020 Chemo plus PDl-1 07/11/2020
6 666666 05/06/2018 PD-1 08/02/2017
So 1111,44444,77777 excluded by treatment condition(not received any PD1/PDL1), 333333 no PDL1_date, 99999 received PD-1 and it's before PDL1date, but received other treatment before PDL1date.
I have tried dplyr filter (PDL1_date<treatmentstart), but I am stuck on comparing ID with same ID.
Please help.

You have to share a reproducible data to help you, but maybe this could help you:
data %>%
filter(grepl("PD-1|PD[Ll]-1",protocoltype)) %>%
group_by(id) %>%
filter(treatmentstart == min(treatmentstart)) %>% ungroup()

Related

Extract tibble_df and text message from activitylog object

I have the code below
library(bupar)
library(daqapo)
hospital<-hospital
hospital %>%
rename(start = start_ts,
complete = complete_ts) -> hospital
hospital %>%
convert_timestamps(c("start","complete"), format = dmy_hms) -> hospital
hospital %>%
activitylog(case_id = "patient_visit_nr",
activity_id = "activity",
resource_id = "originator",
timestamps = c("start", "complete")) -> hospital
hospital %>%
detect_time_anomalies()
which gives
*** OUTPUT ***
For 5 rows in the activity log (9.43%), an anomaly is detected.
The anomalies are spread over the activities as follows:
# A tibble: 3 × 3
activity type n
<chr> <chr> <int>
1 Registration negative duration 3
2 Clinical exam zero duration 1
3 Trage negative duration 1
Anomalies are found in the following rows:
# Log of 10 events consisting of:
3 traces
3 cases
5 instances of 3 activities
5 resources
Events occurred from 2017-11-21 11:22:16 until 2017-11-21 19:00:00
# Variables were mapped as follows:
Case identifier: patient_visit_nr
Activity identifier: activity
Resource identifier: originator
Timestamps: start, complete
# A tibble: 5 × 10
patient_visit_nr activity originator start complete triagecode specialization .order durat…¹ type
<dbl> <chr> <chr> <dttm> <dttm> <dbl> <chr> <int> <dbl> <chr>
1 518 Registration Clerk 12 2017-11-21 11:45:16 2017-11-21 11:22:16 4 PED 1 -23 nega…
2 518 Registration Clerk 6 2017-11-21 11:45:16 2017-11-21 11:22:16 4 PED 2 -23 nega…
3 518 Registration Clerk 9 2017-11-21 11:45:16 2017-11-21 11:22:16 4 PED 3 -23 nega…
4 520 Trage Nurse 17 2017-11-21 13:43:16 2017-11-21 13:39:00 5 URG 4 -4.27 nega…
5 528 Clinical exam Doctor 1 2017-11-21 19:00:00 2017-11-21 19:00:00 3 TRAU 5 0 zero…
# … with abbreviated variable name ¹​duration
from this output I would like to extract the text message
For 5 rows in the activity log (9.43%), an anomaly is detected.
The anomalies are spread over the activities as follows:
and also in another object the tibble
activity type n
<chr> <chr> <int>
1 Registration negative duration 3
2 Clinical exam zero duration 1
3 Trage negative duration 1
The string you are trying to obtain is a message, and though it is possible to capture a message, it's not that straightforward. You can easily generate it by emulating a couple of lines within the function though.
If you store the result of detect_time_anomalies:
anomalies <- hospital %>% detect_time_anomalies()
Then you can generate the message like this:
paste0("For ", nrow(anomalies), " rows in the activity log (",
round(nrow(anomalies)/nrow(hospital) * 100, 2),
"%), an anomaly is detected.")
#> [1] "For 5 rows in the activity log (9.43%), an anomaly is detected."
Similarly, you can obtain the output table like this:
anomalies %>%
group_by(activity, type) %>%
summarize(n = n()) %>%
arrange(desc(n))
#> # A tibble: 3 x 3
#> activity type n
#> <chr> <chr> <int>
#> 1 Registration negative duration 3
#> 2 Clinical exam zero duration 1
#> 3 Trage negative duration 1
Created on 2022-12-13 with reprex v2.0.2

Find treatment course breaks longer than a year based on dates

I have a dataframe with ID and treatment dates like this as below
ID Dates
1 01/2/2012
1 02/8/2012
1 03/8/2012
1 04/5/2013
1 05/5/2013
2 01/2/2012
2 03/5/2013
2 04/6/2013
I need to find for each ID, if there is a treatment date break for more than a year. If yes, then I need to break them into two courses, and list the start & end date. So after executing R codes, it will look like below:
ID Course1StarteDate Course1EndDate Break1to2(Yr) Course2StartDate Course2EndDate
1 01/2/2012 03/8/2012 1.075 04/5/2013 05/5/2013
2 01/2/2012 01/2/2012 1.173 03/5/2013 04/6/2013
The dataframe I have includes hundreds of IDs, and I don't know how many courses there will be. Is there an efficient way of using R to solve this? Thanks in advance!
If d is your data, you can identify when the difference between a row's date and the prior row's date exceeds 365 (or perhaps 365.25), and, then use cumsum to generate the distinct treatment courses. Finally add a column that estimates the duration of the "break" between courses.
as_tibble(d) %>%
group_by(ID) %>%
mutate(trt=as.numeric(Dates-lag(Dates)),
trt=cumsum(if_else(is.na(trt),0,trt)>365)+1) %>%
group_by(ID,trt) %>%
summarize(StartDate = min(Dates),
EndDate = max(Dates),.groups = "drop_last") %>%
mutate(Break:=as.numeric(lead(StartDate) - EndDate)/365)
Output:
ID trt StartDate EndDate Break
<dbl> <dbl> <date> <date> <dbl>
1 1 1 2012-01-02 2012-03-08 1.08
2 1 2 2013-04-05 2013-05-05 NA
3 2 1 2012-01-02 2012-01-02 1.17
4 2 2 2013-03-05 2013-04-06 NA
I would suggest keeping in this long format, rather than swinging to wide format as you have in your example, especially with hundreds of IDs, all with potentially different numbers of courses. The long format is almost always better.
However, if you really want this, you can continue the pipeline from above, like this:
ungroup %>%
pivot_wider(id_cols =ID,
names_from = trt,
values_from = c(StartDate:Break),
names_glue = "Course{trt}_{.value}",
names_vary = "slowest")
to produce this "wide" format:
ID Course1_StartDate Course1_EndDate Course1_Break Course2_StartDate Course2_EndDate Course2_Break
<dbl> <date> <date> <dbl> <date> <date> <dbl>
1 1 2012-01-02 2012-03-08 1.08 2013-04-05 2013-05-05 NA
2 2 2012-01-02 2012-01-02 1.17 2013-03-05 2013-04-06 NA

R Filter by sequence of events

I have an event log data. For reproducible example, let's use the data from eventdataR
eventdataR::patients
## look at patient 1 sequence
eventdataR::patients %>% dplyr::filter(patient == '1')
# A tibble: 12 x 7
handling patient employee handling_id registration_ty~ time .order
<fct> <chr> <fct> <chr> <fct> <dttm> <int>
1 Registration 1 r1 1 start 2017-01-02 11:41:53 1
2 Triage and A~ 1 r2 501 start 2017-01-02 12:40:20 2
3 Blood test 1 r3 1001 start 2017-01-05 08:59:04 3
4 MRI SCAN 1 r4 1238 start 2017-01-05 21:37:12 4
5 Discuss Resu~ 1 r6 1735 start 2017-01-07 07:57:49 5
6 Check-out 1 r7 2230 start 2017-01-09 17:09:43 6
7 Registration 1 r1 1 complete 2017-01-02 12:40:20 7
8 Triage and A~ 1 r2 501 complete 2017-01-02 22:32:25 8
9 Blood test 1 r3 1001 complete 2017-01-05 14:34:27 9
10 MRI SCAN 1 r4 1238 complete 2017-01-06 01:54:23 10
11 Discuss Resu~ 1 r6 1735 complete 2017-01-07 10:18:08 11
12 Check-out 1 r7 2230 complete 2017-01-09 19:45:45 12
In the above example, we can see the sequence of handling for patient 1 over a period of time. We can imagine that different patients would have different sequences or went through different number of sequences.
Now let's say I'm interested in a specific sequence and want to know which patients had gone through this specific sequence. How can I filter this dataset by this specific sequence so that I can get to know who these patients are?
The filter_activity_presence from edeaR library can help me with identifying the unique sequences and its frequency
patients %>% traces
# A tibble: 7 x 3
trace absolute_frequen~ relative_frequen~
<chr> <int> <dbl>
1 Registration,Triage and Assessment,X-Ray,Discuss R~ 258 0.516
2 Registration,Triage and Assessment,Blood test,MRI ~ 234 0.468
3 Registration,Triage and Assessment,Blood test,MRI ~ 2 0.004
4 Registration,Triage and Assessment,X-Ray 2 0.004
5 Registration,Triage and Assessment 2 0.004
6 Registration,Triage and Assessment,X-Ray,Discuss R~ 1 0.002
7 Registration,Triage and Assessment,Blood test 1 0.002
Let's say I'm interested in sequence from row 5, that is patients who had exclusively this sequence Registration -> Triage -> Assessment, there are 2 patients who had this sequence.
It seems the library that doesn't provide ready made function to extract this. At least from this doc page, https://www.bupar.net/subsetting.html#trace_length, it's not available.
Basically, given an exhaustive list of sequence, return all the patients who had gone through exactly this sequence.
In fact, if I can rebuild the trace and map it back to the original dataset, that should allow for a simple dplyr::filter. But this may not be ideal as well in the case if I'm interested in open ended sequence, for example, find all patients who started with Registration -> Triage and can be followed by any sequence.
Here's my long-winded attempt
# get trace for each patient
patient_trace <- as_tibble(patients) %>% group_by(patient) %>% dplyr::filter(registration_type == 'complete') %>%
summarise(trace = paste(handling, collapse = ","), n = n())
# identify the sequence trace of interest
trace_summary <- patients %>% traces
# here we want to see patients who had the sequence from row 5
res <- patients %>%
dplyr::filter(patient %in% c(patient_trace %>% dplyr::filter(trace %in% trace_summary$trace[5]) %>% .$patient)) %>%
dplyr::filter(registration_type == 'complete') %>%
arrange(patient, time)
# A tibble: 4 x 7
handling patient employee handling_id registration_ty~ time .order
<fct> <chr> <fct> <chr> <fct> <dttm> <int>
1 Registration 499 r1 499 complete 2018-05-01 22:57:38 1
2 Triage and As~ 499 r2 999 complete 2018-05-04 23:53:27 3
3 Registration 500 r1 500 complete 2018-05-02 01:28:23 2
4 Triage and As~ 500 r2 1000 complete 2018-05-05 07:16:02 4
You can filter them with dplyr :
library(dplyr)
req_sequence <- c('Registration', 'Triage and Assessment')
eventdataR::patients %>%
group_by(patient) %>%
filter(all(handling == req_sequence)) %>%
filter(registration_type == 'complete') %>%
ungroup
# handling patient employee handling_id registration_type time .order
# <fct> <chr> <fct> <chr> <fct> <dttm> <int>
#1 Registration 499 r1 499 complete 2018-05-01 22:57:38 3220
#2 Registration 500 r1 500 complete 2018-05-02 01:28:23 3221
#3 Triage and Assessment 499 r2 999 complete 2018-05-04 23:53:27 3720
#4 Triage and Assessment 500 r2 1000 complete 2018-05-05 07:16:02 3721
For this case to be sure of the output and to avoid any recycling effect we can filter registration_type == 'complete' first and also add another check of length(req_sequence) equal to number of rows for the patient id.
eventdataR::patients %>%
filter(registration_type == 'complete') %>%
group_by(patient) %>%
filter(length(req_sequence) == n() && all(handling == req_sequence)) %>%
ungroup

Is there a way to filter that does not include duplicates/repeated entries by particular groups?

Some context first:
I'm working with a data set which includes health related data. It includes questionnaire scores pre and post treatment. However, some clients reappear within the data for further treatment. I've provided a mock example of the data in the code section.
I have tried to come up with a solution on dplyr as this is package I'm most familiar with, but I didn't achieve what I've wanted.
#Example/mock data
ClientNumber<-c("4355", "2231", "8894", "9002", "4355", "2231", "8894", "9002", "4355", "2231")
Pre_Post<-c(1,1,1,1,2,2,2,2,1,1)
QuestionnaireScore<-c(62,76,88,56,22,30, 35,40,70,71)
df<-data.frame(ClientNumber, Pre_Post, QuestionnaireScore)
df$ClientNumber<-as.character(df$ClientNumber)
df$Pre_Post<-as.factor(df$Pre_Post)
View(df)
#tried solution
df2<-df%>%
group_by(ClientNumber)%>%
filter( Pre_Post==1|Pre_Post==2)
#this doesn't work, or needs more code to it
As you can see, the first four client numbers both have a pre and post treatment score. This is good. However, client numbers 4355 and 2231 appear again at the end (you could say they have relapsed and started new treatment). These two clients do not have a post treatment score.
I only want to analyse clients that have a pre and post score, therefore I need to filter clients which have completed treatment, while excluding ones that do not have a post treatment score if they have appeared in the data again. In relation to the example I've provided, I want to include the first 8 for analysis while excluding the last two, as they do not have a post treatment score.
If these cases are to be kept in order, you could try:
library(dplyr)
df %>%
group_by(ClientNumber) %>%
filter(!duplicated(Pre_Post) & n_distinct(Pre_Post) == 2)
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 4355 1 62
2 2231 1 76
3 8894 1 88
4 9002 1 56
5 4355 2 22
6 2231 2 30
7 8894 2 35
8 9002 2 40
I don't know if you actually need to use n_distinct() but it won't hurt to keep it. This will remove cases who have a pre score but no post score if they exist in the data.
First arrange ClientNumbers then group_by and finally filter using dplyr::lead and dplyr::lag
library(dplyr)
df %>% arrange(ClientNumber) %>% group_by(ClientNumber) %>%
filter(Pre_Post==1 & lead(Pre_Post)==2 | Pre_Post==2 & lag(Pre_Post)==1)
# A tibble: 8 x 3
# Groups: ClientNumber [4]
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 2231 1 76
2 2231 2 30
3 4355 1 62
4 4355 2 22
5 8894 1 88
6 8894 2 35
7 9002 1 56
8 9002 2 40
Another option is to create groups of 2 for every ClientNumber and select only those groups which have 2 rows in them.
library(dplyr)
df %>%
arrange(ClientNumber) %>%
group_by(ClientNumber, group = cumsum(Pre_Post == 1)) %>%
filter(n() == 2) %>%
ungroup() %>%
select(-group)
# ClientNumber Pre_Post QuestionnaireScore
# <chr> <fct> <dbl>
#1 2231 1 76
#2 2231 2 30
#3 4355 1 62
#4 4355 2 22
#5 8894 1 88
#6 8894 2 35
#7 9002 1 56
#8 9002 2 40
The same can be translated in base R using ave
new_df <- df[order(df$ClientNumber), ]
subset(new_df, ave(Pre_Post,ClientNumber,cumsum(Pre_Post == 1),FUN = length) == 2)

time differences for multiple events for same ID in R

I'm new to Stackoverflow and looked at similar posts but couldn't find a solution that can capture time differences from multiple events from the same ID.
What I've got:
Time<-c('2016-10-04','2016-10-18', '2016-10-04','2016-10-18','2016-10-19','2016-10-28','2016-10-04','2016-10-19','2016-10-21','2016-10-22', '2017-01-02', '2017-03-04')
Value<-c(0,1,0,1,0,0,0,1,0,1,1,0)
StoreID<-c('a','a','b','b','c','c','d','d','a','a','d','c')
Unit<-c(1,1,2,2,5,5,6,6,1,1,6,5)
Helper<-c('a1','a1','b2','b2','c5','c5','d6','d6','a1','a1','d6','c5')
The helper column is the StoreID and Unit combined because I couldn't figure out how to group by both Store ID and the Unit. I want to sort the data to show when the unit was disabled (value =0) and enabled again (value =1).
Ultimately, I'd want:
Store_ID Unit Helper Time(v=0) Time(v=1) Time2(v=0) Time 2(v=1)
a 1 a1 2016-10-04 2016-10-18 2016-10-21 2016-10-22
b 2 b2 2016-10-04 2016-10-18
c 5 c5 2016-10-19 2016-10-28 2017-03-04
d 6 d6 2016-10-04 2017-10-19
Any thoughts?
I'm thinking something in dplyr but am stumped about where to go further.
Create a Header column that combines the Value column and the row number that distinguishes duplicates, then spread to wide format:
Didn't use the helper column, grouped by StoredID and Unit instead.
df <- data.frame(StoreID, Unit, Time, Value)
df %>%
group_by(StoreID, Unit, Value) %>%
mutate(Headers = sprintf('Time %s (v=%s)', row_number(), Value)) %>%
ungroup() %>% select(-Value) %>%
spread(Headers, Time)
# A tibble: 4 x 7
# StoreID Unit `Time 1 (v=0)` `Time 1 (v=1)` `Time 2 (v=0)` `Time 2 (v=1)` `Time 3 (v=0)`
#* <fctr> <dbl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 a 1 2016-10-04 2016-10-18 2016-10-21 2016-10-22 NA
#2 b 2 2016-10-04 2016-10-18 NA NA NA
#3 c 5 2016-10-19 NA 2016-10-28 NA 2017-03-04
#4 d 6 2016-10-04 2016-10-19 NA 2017-01-02 NA

Resources