I have an event log data. For reproducible example, let's use the data from eventdataR
## look at patient 1 sequence
eventdataR::patients %>% dplyr::filter(patient == '1')
# A tibble: 12 x 7
handling patient employee handling_id registration_ty~ time .order
<fct> <chr> <fct> <chr> <fct> <dttm> <int>
1 Registration 1 r1 1 start 2017-01-02 11:41:53 1
2 Triage and A~ 1 r2 501 start 2017-01-02 12:40:20 2
3 Blood test 1 r3 1001 start 2017-01-05 08:59:04 3
4 MRI SCAN 1 r4 1238 start 2017-01-05 21:37:12 4
5 Discuss Resu~ 1 r6 1735 start 2017-01-07 07:57:49 5
6 Check-out 1 r7 2230 start 2017-01-09 17:09:43 6
7 Registration 1 r1 1 complete 2017-01-02 12:40:20 7
8 Triage and A~ 1 r2 501 complete 2017-01-02 22:32:25 8
9 Blood test 1 r3 1001 complete 2017-01-05 14:34:27 9
10 MRI SCAN 1 r4 1238 complete 2017-01-06 01:54:23 10
11 Discuss Resu~ 1 r6 1735 complete 2017-01-07 10:18:08 11
12 Check-out 1 r7 2230 complete 2017-01-09 19:45:45 12
In the above example, we can see the sequence of handling for patient 1 over a period of time. We can imagine that different patients would have different sequences or went through different number of sequences.
Now let's say I'm interested in a specific sequence and want to know which patients had gone through this specific sequence. How can I filter this dataset by this specific sequence so that I can get to know who these patients are?
The filter_activity_presence from edeaR library can help me with identifying the unique sequences and its frequency
patients %>% traces
# A tibble: 7 x 3
trace absolute_frequen~ relative_frequen~
<chr> <int> <dbl>
1 Registration,Triage and Assessment,X-Ray,Discuss R~ 258 0.516
2 Registration,Triage and Assessment,Blood test,MRI ~ 234 0.468
3 Registration,Triage and Assessment,Blood test,MRI ~ 2 0.004
4 Registration,Triage and Assessment,X-Ray 2 0.004
5 Registration,Triage and Assessment 2 0.004
6 Registration,Triage and Assessment,X-Ray,Discuss R~ 1 0.002
7 Registration,Triage and Assessment,Blood test 1 0.002
Let's say I'm interested in sequence from row 5, that is patients who had exclusively this sequence Registration -> Triage -> Assessment, there are 2 patients who had this sequence.
It seems the library that doesn't provide ready made function to extract this. At least from this doc page, https://www.bupar.net/subsetting.html#trace_length, it's not available.
Basically, given an exhaustive list of sequence, return all the patients who had gone through exactly this sequence.
In fact, if I can rebuild the trace and map it back to the original dataset, that should allow for a simple dplyr::filter. But this may not be ideal as well in the case if I'm interested in open ended sequence, for example, find all patients who started with Registration -> Triage and can be followed by any sequence.
Here's my long-winded attempt
# get trace for each patient
patient_trace <- as_tibble(patients) %>% group_by(patient) %>% dplyr::filter(registration_type == 'complete') %>%
summarise(trace = paste(handling, collapse = ","), n = n())
# identify the sequence trace of interest
trace_summary <- patients %>% traces
# here we want to see patients who had the sequence from row 5
res <- patients %>%
dplyr::filter(patient %in% c(patient_trace %>% dplyr::filter(trace %in% trace_summary$trace[5]) %>% .$patient)) %>%
dplyr::filter(registration_type == 'complete') %>%
arrange(patient, time)
# A tibble: 4 x 7
handling patient employee handling_id registration_ty~ time .order
<fct> <chr> <fct> <chr> <fct> <dttm> <int>
1 Registration 499 r1 499 complete 2018-05-01 22:57:38 1
2 Triage and As~ 499 r2 999 complete 2018-05-04 23:53:27 3
3 Registration 500 r1 500 complete 2018-05-02 01:28:23 2
4 Triage and As~ 500 r2 1000 complete 2018-05-05 07:16:02 4

You can filter them with dplyr :
req_sequence <- c('Registration', 'Triage and Assessment')
eventdataR::patients %>%
group_by(patient) %>%
filter(all(handling == req_sequence)) %>%
filter(registration_type == 'complete') %>%
# handling patient employee handling_id registration_type time .order
# <fct> <chr> <fct> <chr> <fct> <dttm> <int>
#1 Registration 499 r1 499 complete 2018-05-01 22:57:38 3220
#2 Registration 500 r1 500 complete 2018-05-02 01:28:23 3221
#3 Triage and Assessment 499 r2 999 complete 2018-05-04 23:53:27 3720
#4 Triage and Assessment 500 r2 1000 complete 2018-05-05 07:16:02 3721
For this case to be sure of the output and to avoid any recycling effect we can filter registration_type == 'complete' first and also add another check of length(req_sequence) equal to number of rows for the patient id.
eventdataR::patients %>%
filter(registration_type == 'complete') %>%
group_by(patient) %>%
filter(length(req_sequence) == n() && all(handling == req_sequence)) %>%


Extract tibble_df and text message from activitylog object

I have the code below
hospital %>%
rename(start = start_ts,
complete = complete_ts) -> hospital
hospital %>%
convert_timestamps(c("start","complete"), format = dmy_hms) -> hospital
hospital %>%
activitylog(case_id = "patient_visit_nr",
activity_id = "activity",
resource_id = "originator",
timestamps = c("start", "complete")) -> hospital
hospital %>%
which gives
*** OUTPUT ***
For 5 rows in the activity log (9.43%), an anomaly is detected.
The anomalies are spread over the activities as follows:
# A tibble: 3 × 3
activity type n
<chr> <chr> <int>
1 Registration negative duration 3
2 Clinical exam zero duration 1
3 Trage negative duration 1
Anomalies are found in the following rows:
# Log of 10 events consisting of:
3 traces
3 cases
5 instances of 3 activities
5 resources
Events occurred from 2017-11-21 11:22:16 until 2017-11-21 19:00:00
# Variables were mapped as follows:
Case identifier: patient_visit_nr
Activity identifier: activity
Resource identifier: originator
Timestamps: start, complete
# A tibble: 5 × 10
patient_visit_nr activity originator start complete triagecode specialization .order durat…¹ type
<dbl> <chr> <chr> <dttm> <dttm> <dbl> <chr> <int> <dbl> <chr>
1 518 Registration Clerk 12 2017-11-21 11:45:16 2017-11-21 11:22:16 4 PED 1 -23 nega…
2 518 Registration Clerk 6 2017-11-21 11:45:16 2017-11-21 11:22:16 4 PED 2 -23 nega…
3 518 Registration Clerk 9 2017-11-21 11:45:16 2017-11-21 11:22:16 4 PED 3 -23 nega…
4 520 Trage Nurse 17 2017-11-21 13:43:16 2017-11-21 13:39:00 5 URG 4 -4.27 nega…
5 528 Clinical exam Doctor 1 2017-11-21 19:00:00 2017-11-21 19:00:00 3 TRAU 5 0 zero…
# … with abbreviated variable name ¹​duration
from this output I would like to extract the text message
For 5 rows in the activity log (9.43%), an anomaly is detected.
The anomalies are spread over the activities as follows:
and also in another object the tibble
activity type n
<chr> <chr> <int>
1 Registration negative duration 3
2 Clinical exam zero duration 1
3 Trage negative duration 1
The string you are trying to obtain is a message, and though it is possible to capture a message, it's not that straightforward. You can easily generate it by emulating a couple of lines within the function though.
If you store the result of detect_time_anomalies:
anomalies <- hospital %>% detect_time_anomalies()
Then you can generate the message like this:
paste0("For ", nrow(anomalies), " rows in the activity log (",
round(nrow(anomalies)/nrow(hospital) * 100, 2),
"%), an anomaly is detected.")
#> [1] "For 5 rows in the activity log (9.43%), an anomaly is detected."
Similarly, you can obtain the output table like this:
anomalies %>%
group_by(activity, type) %>%
summarize(n = n()) %>%
#> # A tibble: 3 x 3
#> activity type n
#> <chr> <chr> <int>
#> 1 Registration negative duration 3
#> 2 Clinical exam zero duration 1
#> 3 Trage negative duration 1
Created on 2022-12-13 with reprex v2.0.2

Finding time difference that meets a conditional statement

I had an R question concerning data wrangling. A sample data set I will include is downloadable online:
x<- read.csv("http://mgimond.github.io/ES218/Data/CO2.csv")
The datatable is shown in the attached image.
Example data table
I want to create a new column, let's say "time_since". This column would look at the "Average" column and calculate the time (in this case months) since "Average" is less than 300. So in this screenshot all are >300, so the value would be "0", but the month that eventually has a value less than 300 would then be "1" (representing 1 month since it has been one month under 300). If the following months are still under 300, this would increase according to the months that go by, but as soon as it become >300 again it will reset.
Basically it would be a function that would calculate the difference in time since a conditional statement is met, then restarts when the conditional is broken across dates.
I apologize if I worded it a bit confusing but hopefully the message comes across.
Maybe you can try :
x %>%
group_by(grp = cumsum(as.integer(Average > 300))) %>%
mutate(time_since = row_number()) %>%
ungroup -> result
Just to show you one excerpt of output where time_since > 1.
result %>% filter(grp == 61)
# Year Month Average Interpolated Trend Daily_mean grp time_since
# <int> <int> <dbl> <dbl> <dbl> <int> <int> <int>
#1 1964 1 320. 320. 320. -1 61 1
#2 1964 2 -100. 320. 320. -1 61 2
#3 1964 3 -100. 321. 320. -1 61 3
#4 1964 4 -100. 322. 319. -1 61 4
Here is a data.table approach. For this example, time_since is displaying the cumulative total of rows when the Average variable is greater than 315.
x<- read.csv("http://mgimond.github.io/ES218/Data/CO2.csv")
x[, ':='(time_since = seq(1:.N)), keyby = .(cumsum(Average < 315))][1:10, ]
#> Year Month Average Interpolated Trend Daily_mean time_since
#> 1: 1959 1 315.62 315.62 315.70 -1 1
#> 2: 1959 2 316.38 316.38 315.88 -1 2
#> 3: 1959 3 316.71 316.71 315.62 -1 3
#> 4: 1959 4 317.72 317.72 315.56 -1 4
#> 5: 1959 5 318.29 318.29 315.50 -1 5
#> 6: 1959 6 318.15 318.15 315.92 -1 6
#> 7: 1959 7 316.54 316.54 315.66 -1 7
#> 8: 1959 8 314.80 314.80 315.81 -1 1
#> 9: 1959 9 313.84 313.84 316.55 -1 1
#> 10: 1959 10 313.26 313.26 316.19 -1 1
Created on 2021-03-17 by the reprex package (v0.3.0)

Is there a way to filter that does not include duplicates/repeated entries by particular groups?

Some context first:
I'm working with a data set which includes health related data. It includes questionnaire scores pre and post treatment. However, some clients reappear within the data for further treatment. I've provided a mock example of the data in the code section.
I have tried to come up with a solution on dplyr as this is package I'm most familiar with, but I didn't achieve what I've wanted.
#Example/mock data
ClientNumber<-c("4355", "2231", "8894", "9002", "4355", "2231", "8894", "9002", "4355", "2231")
QuestionnaireScore<-c(62,76,88,56,22,30, 35,40,70,71)
df<-data.frame(ClientNumber, Pre_Post, QuestionnaireScore)
#tried solution
filter( Pre_Post==1|Pre_Post==2)
#this doesn't work, or needs more code to it
As you can see, the first four client numbers both have a pre and post treatment score. This is good. However, client numbers 4355 and 2231 appear again at the end (you could say they have relapsed and started new treatment). These two clients do not have a post treatment score.
I only want to analyse clients that have a pre and post score, therefore I need to filter clients which have completed treatment, while excluding ones that do not have a post treatment score if they have appeared in the data again. In relation to the example I've provided, I want to include the first 8 for analysis while excluding the last two, as they do not have a post treatment score.
If these cases are to be kept in order, you could try:
df %>%
group_by(ClientNumber) %>%
filter(!duplicated(Pre_Post) & n_distinct(Pre_Post) == 2)
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 4355 1 62
2 2231 1 76
3 8894 1 88
4 9002 1 56
5 4355 2 22
6 2231 2 30
7 8894 2 35
8 9002 2 40
I don't know if you actually need to use n_distinct() but it won't hurt to keep it. This will remove cases who have a pre score but no post score if they exist in the data.
First arrange ClientNumbers then group_by and finally filter using dplyr::lead and dplyr::lag
df %>% arrange(ClientNumber) %>% group_by(ClientNumber) %>%
filter(Pre_Post==1 & lead(Pre_Post)==2 | Pre_Post==2 & lag(Pre_Post)==1)
# A tibble: 8 x 3
# Groups: ClientNumber [4]
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 2231 1 76
2 2231 2 30
3 4355 1 62
4 4355 2 22
5 8894 1 88
6 8894 2 35
7 9002 1 56
8 9002 2 40
Another option is to create groups of 2 for every ClientNumber and select only those groups which have 2 rows in them.
df %>%
arrange(ClientNumber) %>%
group_by(ClientNumber, group = cumsum(Pre_Post == 1)) %>%
filter(n() == 2) %>%
ungroup() %>%
# ClientNumber Pre_Post QuestionnaireScore
# <chr> <fct> <dbl>
#1 2231 1 76
#2 2231 2 30
#3 4355 1 62
#4 4355 2 22
#5 8894 1 88
#6 8894 2 35
#7 9002 1 56
#8 9002 2 40
The same can be translated in base R using ave
new_df <- df[order(df$ClientNumber), ]
subset(new_df, ave(Pre_Post,ClientNumber,cumsum(Pre_Post == 1),FUN = length) == 2)

update overall mean (review score) per day and subject

I have a dataset of several game reviews and I want to calculate the respective overall score each game had until the respective day - so basically the overall score a user saw on each given day.
The reviews are binary so it's just a vote up/down system where each 1 in the column positive marks an upvote:
game_id created positive
123 2018-07-18 1
123 2018-07-18 0
123 2018-07-18 1
123 2018-07-19 1
456 2018-06-23 1
456 2018-06-25 1
456 2018-06-25 0
456 2018-06-26 1
789 2018-07-18 1
calculating the overall mean per day is easy with
group_by(game_id, created) %>%
but I'm struggling with how to include the reviews of the days before.
I want it to look like this:
game_id created total_score
123 2018-07-18 0.66
123 2018-07-19 0.75
456 2018-06-23 1.0
456 2018-06-25 0.5
456 2018-06-26 0.66
789 2018-07-18 1
I thought about using combination of a loop and an if statement but am not really able to formulate it (and doubtful about its efficiency for larger datasets...)
Here's a way to achieve it using dplyr. The key here is to create an intermediate computation of cumulative sums and then use those for the ratio:
df %>%
group_by(game_id, created) %>%
summarise(pos=sum(positive), tot=n()) %>%
group_by(game_id) %>%
mutate(pct = cumsum(pos) / cumsum(tot))
# A tibble: 6 x 5
# Groups: game_id [3]
game_id created pos tot pct
<int> <fct> <int> <int> <dbl>
1 123 2018-07-18 2 3 0.667
2 123 2018-07-19 1 1 0.75
3 456 2018-06-23 1 1 1
4 456 2018-06-25 1 2 0.667
5 456 2018-06-26 1 1 0.75
6 789 2018-07-18 1 1 1
Assuming your dataframe is named df you can:
df= arrange(df, game_id,created) ## sort dataset
df$csum <- ave(df$positive, df$game_id, FUN=cumsum) ## create cumulative sum
to create the cumulative sum up for each game_id. Make sure your dataframe is sorted by game_id and created

time differences for multiple events for same ID in R

I'm new to Stackoverflow and looked at similar posts but couldn't find a solution that can capture time differences from multiple events from the same ID.
What I've got:
Time<-c('2016-10-04','2016-10-18', '2016-10-04','2016-10-18','2016-10-19','2016-10-28','2016-10-04','2016-10-19','2016-10-21','2016-10-22', '2017-01-02', '2017-03-04')
The helper column is the StoreID and Unit combined because I couldn't figure out how to group by both Store ID and the Unit. I want to sort the data to show when the unit was disabled (value =0) and enabled again (value =1).
Ultimately, I'd want:
Store_ID Unit Helper Time(v=0) Time(v=1) Time2(v=0) Time 2(v=1)
a 1 a1 2016-10-04 2016-10-18 2016-10-21 2016-10-22
b 2 b2 2016-10-04 2016-10-18
c 5 c5 2016-10-19 2016-10-28 2017-03-04
d 6 d6 2016-10-04 2017-10-19
Any thoughts?
I'm thinking something in dplyr but am stumped about where to go further.
Create a Header column that combines the Value column and the row number that distinguishes duplicates, then spread to wide format:
Didn't use the helper column, grouped by StoredID and Unit instead.
df <- data.frame(StoreID, Unit, Time, Value)
df %>%
group_by(StoreID, Unit, Value) %>%
mutate(Headers = sprintf('Time %s (v=%s)', row_number(), Value)) %>%
ungroup() %>% select(-Value) %>%
spread(Headers, Time)
# A tibble: 4 x 7
# StoreID Unit `Time 1 (v=0)` `Time 1 (v=1)` `Time 2 (v=0)` `Time 2 (v=1)` `Time 3 (v=0)`
#* <fctr> <dbl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 a 1 2016-10-04 2016-10-18 2016-10-21 2016-10-22 NA
#2 b 2 2016-10-04 2016-10-18 NA NA NA
#3 c 5 2016-10-19 NA 2016-10-28 NA 2017-03-04
#4 d 6 2016-10-04 2016-10-19 NA 2017-01-02 NA
