I'm looking to find >=4 unique events that all occur within a group within a 90 day period and then flag the ID.
Just a test example:
library(dplyr)
set.seed(1)
test <- data.frame(
PATID = sample(1:1e4, 1e5, replace = TRUE),
PROV = sample(1:50, 1e5, replace = TRUE),
GROUP = sample(0:1, 1e5, replace = TRUE),
DATE = as.Date(sample(
as.Date("2020-01-01"):as.Date("2020-12-31"),
1e5,
replace = TRUE
), origin = "1970-01-01")
)
If we look at PATID==5 we can see there are 4 unique PROVs with overlapping dates within 90 days and within our group of interest and so should be flagged.
> test %>% filter(PATID==5) %>% arrange(GROUP,DATE)
PATID PROV GROUP DATE
1 5 2 0 2020-05-07
2 5 3 0 2020-05-20
3 5 3 0 2020-11-15
4 5 49 0 2020-12-14
5 5 45 1 2020-02-16
6 5 50 1 2020-03-19
7 5 38 1 2020-03-25
8 5 27 1 2020-03-29
9 5 42 1 2020-08-30
10 5 46 1 2020-11-03
11 5 25 1 2020-11-13
12 5 29 1 2020-12-26
> as.Date("2020-03-29")-as.Date("2020-02-16")<=90
[1] TRUE
Ultimately, I'm looking for the proportion of GROUP==1 vs GROUP==0 with >=4 unique PROVs within 90 days. Ideally I'd prefer using data.table simply due to the scale of data.
Trying out some code:
test %>%
filter(PATID %in% 1:5) %>%
group_by(PATID,GROUP) %>%
arrange(GROUP, DATE) %>%
mutate(lag = DATE - lag(DATE),
day_count = case_when(lag <= 90 ~ TRUE,
is.na(lag) ~ TRUE,
TRUE ~ FALSE)) %>%
mutate(crit = cumsum_reset(day_count)) %>%
ungroup() %>%
group_by(PATID) %>%
mutate(flag = case_when(max(crit) >= 4 ~ 1,
TRUE ~ 0)) %>%
arrange(PATID)
Getting closer, just need to sort out the 90 window versus just crudely testing if each date is within 90 days.
Maybe the following is what you are after. Please check if the logic is what you meant. I left more explicit than necessary so that the idea can be more easily understood. The main idea is that if after sorting there is a observation from same PATDI & GROUP that is within 90 days from the 3rd lag diff_3 := DATE - shift(DATE, 3), than it should be flagged. This is done by checking diff_check = diff_3<=90. If any observation for any PATID/GROUP is flagged, the corresponding ID will be flagged by the keep = max(diff_check, na.rm = TRUE, pmin = 0) after grouping by only PATID.
Using the third lag to account for 4 or more and not strictly more than 4 observations.
Does it, all in all, make any sense?
library(data.table)
set.seed(1)
test <- data.frame(
PATID = sample(1:1e4, 1e5, replace = TRUE),
PROV = sample(1:50, 1e5, replace = TRUE),
GROUP = sample(0:1, 1e5, replace = TRUE),
DATE = as.Date(sample(
as.Date("2020-01-01"):as.Date("2020-12-31"),
1e5,
replace = TRUE
), origin = "1970-01-01")
)
test %>% filter(PATID==5) %>% arrange(GROUP,DATE)
#> Error in test %>% filter(PATID == 5) %>% arrange(GROUP, DATE): could not find function "%>%"
dt <- as.data.table(test)
dt <- dt[order(PATID, GROUP, DATE)]
dt[, diff_3 := DATE - shift(DATE, 3), by = c("PATID", "GROUP")]
# check amount of unique values of PROV in previous 4 observations
dt[, unique_last_4 := frollapply(x = PROV, n = 4, FUN = uniqueN), by = c("PATID", "GROUP")]
# check if within 90 days and unique PROVs
dt[, diff_check := diff_3<=90 & unique_last_4==4, by = c("PATID", "GROUP")]
# final check to flag all observations of ID that satisfied at least once the above checks
dt[, to_keep := max(diff_check, na.rm = TRUE, pmin = 0), by = "PATID"]
# NOTE: unsure if you mean to group only by PATID here or by PATID & GROUP.
head(dt[to_keep==1], 20)
#> PATID PROV GROUP DATE diff_3 unique_last_4 diff_check to_keep
#> 1: 5 2 0 2020-05-07 NA days NA NA 1
#> 2: 5 3 0 2020-05-20 NA days NA NA 1
#> 3: 5 3 0 2020-11-15 NA days NA NA 1
#> 4: 5 49 0 2020-12-14 221 days 3 FALSE 1
#> 5: 5 45 1 2020-02-16 NA days NA NA 1
#> 6: 5 50 1 2020-03-19 NA days NA NA 1
#> 7: 5 38 1 2020-03-25 NA days NA NA 1
#> 8: 5 27 1 2020-03-29 42 days 4 TRUE 1
#> 9: 5 42 1 2020-08-30 164 days 4 FALSE 1
#> 10: 5 46 1 2020-11-03 223 days 4 FALSE 1
#> 11: 5 25 1 2020-11-13 229 days 4 FALSE 1
#> 12: 5 29 1 2020-12-26 118 days 4 FALSE 1
#> 13: 7 1 0 2020-04-10 NA days NA NA 1
#> 14: 7 44 0 2020-04-29 NA days NA NA 1
#> 15: 7 27 0 2020-05-05 NA days NA NA 1
#> 16: 7 41 0 2020-06-11 62 days 4 TRUE 1
#> 17: 7 35 0 2020-06-30 62 days 4 TRUE 1
#> 18: 7 11 0 2020-12-18 227 days 4 FALSE 1
#> 19: 7 24 1 2020-12-24 NA days NA NA 1
#> 20: 7 13 1 2020-12-29 NA days NA NA 1
Created on 2021-06-22 by the reprex package (v2.0.0)
dplyr version
test_keep <- test %>% arrange(PATID, GROUP, DATE) %>%
head(1000) %>% # otherwise it takes too long in my pc, which shows data.table's efficiency!
group_by(PATID, GROUP) %>%
mutate(diff_3 = DATE - lag(DATE, 3),
diff_check = diff_3<=90,
unique_last_4 = frollapply(x = PROV, n = 4, FUN = uniqueN)
) %>% group_by(PATID) %>%
mutate(keep = max(diff_check, na.rm = TRUE, pmin = 0)) %>%
arrange(PATID, GROUP)
test_keep %>% filter(keep==1) %>% head(20)
Based on I'm looking for the annual "group" proportion of patients that visit >=4 providers within 90 days, you can try this:
library(data.table) #data.table 1.13.2
setDT(test)[, c("d90ago", "d90aft") := .(DATE - 90L, DATE + 90L)]
setkey(test, PATID, DATE)
test[, grp :=
.SD[.SD, on=.(PATID, DATE>=d90ago, DATE<=d90aft), by=.EACHI, +(length(unique(x.PROV))>=4L)]$V1
]
The above allows PROV within overlapping windows of 90 days to be re-used.
There's some ambiguities in the question, so this may not be quite right. I tried doing this using dplyr and local data frames, but the self-join causes an overflow (100,000 times 100,000).
It seems to work using data.table and using PostgreSQL, which has an OVERLAPS function.
(Note that I used lower-case variable names to make working with SQL easier.)
In the answer below, I start with a patient visit ((patid, prov, group, date) combination) and look forward 90 days to capture all visits by that patient (patid) to other providers (prov != prov_other). I then count the number of distinct providers in that lookahead period (this will be NA when there are no visits, as when looking at a patient's last visit in the sample). I then count the number of visits where the number of additional distinct providers in the subsequent 90 days is 3 or more.
Finally, I group by (group, year) and count the proportion of visits that are followed by visits to at least three other providers during the subsequent 90 days. Given the way the data are generated, it is no surprise that the two groups look similar on this metric.
Note that each patient visit forms a unit of observation here. In practice, it may make sense to aggregate by (say) (patid, year) before calculating statistics or do some other kind of aggregation.
library(data.table)
library(dplyr, warn.conflicts = FALSE)
set.seed(1)
test <- tibble(
patid = sample(1:1e4, 1e5, replace = TRUE),
prov = sample(1:50, 1e5, replace = TRUE),
group = sample(0:1, 1e5, replace = TRUE),
date = as.Date(sample(
as.Date("2020-01-01"):as.Date("2020-12-31"),
1e5,
replace = TRUE
), origin = "1970-01-01")) %>%
as.data.table()
test
#> patid prov group date
#> 1: 1017 6 1 2020-08-03
#> 2: 8004 34 0 2020-12-15
#> 3: 4775 32 0 2020-06-21
#> 4: 9725 47 1 2020-09-25
#> 5: 8462 15 0 2020-03-05
#> ---
#> 99996: 949 47 0 2020-07-05
#> 99997: 2723 37 0 2020-08-18
#> 99998: 201 27 1 2020-01-06
#> 99999: 163 9 0 2020-03-06
#> 100000: 3204 48 1 2020-11-17
df_overlap <-
test %>%
inner_join(test, by = "patid", suffix = c("", "_other")) %>%
filter(prov != prov_other) %>%
filter(date_other >= date & date_other <= date + 90L)
mt_4_provs_df <-
df_overlap %>%
group_by(patid, prov, group, date) %>%
summarize(n_providers = n_distinct(prov_other), .groups = "drop")
results <-
test %>%
left_join(mt_4_provs_df, by = c("patid", "prov", "group", "date")) %>%
mutate(mt_4_provs = n_providers >= 3,
year = year(date)) %>%
group_by(group, year) %>%
summarize(prop_mt_4_provs = mean(mt_4_provs, na.rm = TRUE),
.groups = "drop")
results
#> # A tibble: 2 x 3
#> group year prop_mt_4_provs
#> <int> <int> <dbl>
#> 1 0 2020 0.426
#> 2 1 2020 0.423
Created on 2021-06-22 by the reprex package (v2.0.0)
Related
I have a system which records sanctions against clients' names.
There should only ever be one active sanction, yet there are some cases where there are multiple active sanctions.
I would like to know how I can count how many people had two or more simultaneously-active sanctions over the past three years (sample data ranges from 2019-2022, so this won't need to be filtered in the solution).
The way I would work this out is to detect those cases where start_date2 occurs before end_date1.
Sample data (note that the end_date values are random, so there may be several cases of them occurring before their respective start_date values, but bear in mind that this is just sample data, so take it with a pinch of salt):
set.seed(147)
sanc <-
data.frame(
client = rep(1:200, each = 5),
start_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000),
end_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000)
)
sanc$start_month_year = format(as.Date(sanc$start_date, "%Y-%m-%d"), "%Y-%m")
The algorithm in my mind goes like this:
for each client
check if there was more than one active sanction at any one time
look for cases where start_date2/start_date3/start_dateY occurs before end_date1/end_date2/end_dateX
group by month-year (using month_year column)
The output I am looking for is a monthly breakdown, indicating how many simultaneous sanctions occurred per month. Something like this:
01-2020: 10
02-2020: 35
03-2020: 29
...
01-2022: 5
I believe that I have covered everything, but am happy to clarify anything where required/requested.
Updated, given clarifications in comment section
If we do this without regard to client, then we have something like this:
sanc %>% arrange(start_date) %>%
mutate(same_as_prev = start_date<lag(end_date) |row_number()==1 & end_date>lead(start_date)) %>%
group_by(start_month_year) %>%
summarize(simActive = sum(same_as_prev))
Output:
# A tibble: 37 x 2
start_month_year simActive
<chr> <int>
1 2019-01 29
2 2019-02 26
3 2019-03 30
4 2019-04 26
5 2019-05 25
6 2019-06 19
7 2019-07 19
8 2019-08 26
9 2019-09 21
10 2019-10 23
# ... with 27 more rows
It seems that in your sample data, all the clients have only one row, so I've adjusted it so that each of 200 clients has 5 rows. I then do something rather simple:
sanc %>% as_tibble() %>%
group_by(client, active = cumsum(start_date>lag(end_date) & row_number()>1)) %>%
filter(n()>1) %>%
ungroup() %>%
distinct(client, active) %>%
count(client, name="simActive")
This returns a list of clients, along with the number of times the client had simultaneous active sanctions.
Output:
# A tibble: 193 x 2
client simActive
<int> <int>
1 1 1
2 2 1
3 3 2
4 4 1
5 5 2
6 6 2
7 7 1
8 8 1
9 9 1
10 10 1
# ... with 183 more rows
So for client 1, there was one time when there was 2 or more active sanctions. The data for client one (see input below) looks like this, and this client had rows 3 and 4 active at the same time.
client start_date end_date start_month_year
1 1 2019-03-18 2019-09-25 2019-03
2 1 2020-10-19 2019-12-03 2020-10
3 1 2021-03-11 2019-11-26 2021-03
4 1 2020-07-06 2021-09-03 2020-07
5 1 2021-05-11 2019-09-06 2021-05
Input:
set.seed(147)
sanc <-
data.frame(
client = rep(1:200, each = 5),
start_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000),
end_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000)
)
sanc$start_month_year = format(as.Date(sanc$start_date, "%Y-%m-%d"), "%Y-%m")
Here is another way to do it. It might not be very performant, but the approach should yield the correct results. See my inline comments for how it works. Further note, that I adjusted your sample data. You did just sample random start and end dates without making sure that start_date < end_date. I changed this so that each start_date is smaller than its end_date.
set.seed(147)
library(dplyr)
sanc <-
tibble(
client = sample(1:500, 1000, replace = TRUE),
start_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-06-01"), by = "day"), 1000),
end_date = round(runif(1000, min = 1, max = 150), 0 ) + start_date
)
sanc %>%
# make each sanction an `lubridate::interval`
mutate(int = interval(start_date, end_date)) %>%
# group_by month and client
group_by(month = format(start_date, "%Y-%m"), client) %>%
# use `lubridate::int_overlaps` to compare all intervals
summarise(overlap = list(outer(int, int, int_overlaps))) %>%
# apply to each row ...
rowwise() %>%
# to get only the lower triangle of each matrix and sum it up
mutate(overlap = sum(overlap[lower.tri(overlap)])) %>%
# now group by month
group_by(month) %>%
# and how many individuals in each month have more than one active sanction
summarise(overlap = sum(overlap))
#> `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.
#> # A tibble: 42 x 2
#> month overlap
#> <chr> <int>
#> 1 2019-01 0
#> 2 2019-02 0
#> 3 2019-03 0
#> 4 2019-04 0
#> 5 2019-05 0
#> 6 2019-06 1
#> 7 2019-07 1
#> 8 2019-08 2
#> 9 2019-09 1
#> 10 2019-10 3
#> # ... with 32 more rows
Created on 2022-03-09 by the reprex package (v2.0.1)
I have the following data frame in R:
df <- data.frame(name = c('p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end'),
time = c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31),
target = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
comb = c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1))
And another data frame:
data <- data.frame(time = c(2,5,8,14,14,20,21,26,28,28),
name = c('a','b','c','d','e','f','g','h','i','j'))
So, if we take a look at df we could sort the data by target and combination and we will notice that there are basically "groups". For example for target=1 and comb=0 there are four entries p1_start,p1_end,p2_start,p2_end and it is the same for all other target/comb combinations.
On the other side data contains entries with time being a timestamp.
Goal: I want to map the values from both data frames based on time.
Example: The first entry of data has time=2 meaning it happened between p1_start,p1_end so it should get the values target=1 and comb=0 mapped to the data data frame.
Example 2: The entries of data with time=14 happened between p2_start,p2_end so they should get the values target=1 and comb=1 mapped to the data data frame.
Idea: I thought I iterate over df by target and comb and for each combination of them check if there are rows in data whose time is between. The second could be done with the following command:
data[which(data$time > p1_start & data$time < p2_end),]
once I get the rows it is easy to append the values.
Problem: how could I do the iteration? I tried with the following:
df %>%
group_by(target, comb) %>%
print(data[which(data$time > df$p1_start & data$time < df$p2_end),])
But I am getting an error that time has not been initialized
Your problem is best known as performing non-equi join. We need to find a range in some given dataframe that corresponds to each value in one or more given vectors. This is better handled by the data.table package.
We would first transform your df into a format suitable for performing the join and then join data with df by time <= end while time >= start. Here is the code
library(data.table)
setDT(df)[, c("type", "name") := tstrsplit(name, "_", fixed = TRUE)]
df <- dcast(df, ... ~ name, value.var = "time")
cols <- c("target", "comb", "type")
setDT(data)[df, (cols) := mget(paste0("i.", cols)), on = .(time<=end, time>=start)]
After dcast, df looks like this
target comb type end start
1: 1 0 p1 3 1
2: 1 0 p2 7 5
3: 1 1 p1 11 9
4: 1 1 p2 15 13
5: 2 0 p1 19 17
6: 2 0 p2 23 21
7: 2 1 p1 27 25
8: 2 1 p2 31 29
And the output is
> data
time name target comb type
1: 2 a 1 0 p1
2: 5 b 1 0 p2
3: 8 c NA NA <NA>
4: 14 d 1 1 p2
5: 14 e 1 1 p2
6: 20 f NA NA <NA>
7: 21 g 2 0 p2
8: 26 h 2 1 p1
9: 28 i NA NA <NA>
10: 28 j NA NA <NA>
Here is a tidyverse solution:
library(tidyr)
library(dplyr)
df %>%
rename(name_df=name) %>%
mutate(x = time +1) %>%
pivot_longer(
cols = c(time, x),
names_to = "helper",
values_to = "time"
) %>%
right_join(data, by="time") %>%
select(time, name, target, comb)
time name target comb
<dbl> <chr> <dbl> <dbl>
1 2 a 1 0
2 5 b 1 0
3 8 c 1 0
4 14 d 1 1
5 14 e 1 1
6 20 f 2 0
7 21 g 2 0
8 26 h 2 1
9 28 i 2 1
10 28 j 2 1
df <- data.frame(name = c('p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end'),
time = c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31),
target = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
comb = c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1))
data <- data.frame(time = c(2,5,8,14,14,20,21,26,28,28),
name = c('a','b','c','d','e','f','g','h','i','j'))
library(fuzzyjoin)
library(tidyverse)
tmp <- df %>%
separate(name,
into = c("p", "period"),
sep = "_",
remove = TRUE) %>%
pivot_wider(
id_cols = c(p, target, comb),
names_from = period,
values_from = time
) %>%
select(-p)
fuzzy_left_join(
x = data,
y = tmp,
by = c("time" = "start",
"time" = "end"),
match_fun = list(`>=`, `<=`))
#> time name target comb start end
#> 1 2 a 1 0 1 3
#> 2 5 b 1 0 5 7
#> 3 8 c NA NA NA NA
#> 4 14 d 1 1 13 15
#> 5 14 e 1 1 13 15
#> 6 20 f NA NA NA NA
#> 7 21 g 2 0 21 23
#> 8 26 h 2 1 25 27
#> 9 28 i NA NA NA NA
#> 10 28 j NA NA NA NA
Created on 2022-01-11 by the reprex package (v2.0.1)
I have to create a function (or loop) in R to detect hyper-frequent.
The requirement to detect hyper-frequent is to come 3 times in 180 days, if that requirement is met the person will be hyper-frequent, not only in the future, but in the past visits where he did not meet the hyper-frequent requirement as well.
pacient <- c(10,10,10,10,10,11,11,12,12,12,13, 13, 15, 14); pacient
date <- as.Date(c("01/01/2018","02/05/2018", "04/06/2018", "10/11/2018", "05/12/2018", "02/01/2018", "06/08/2018", "01/01/2018", "03/01/2018", "06/03/2018", "05/08/2018", "05/08/2019", "05/07/2019", "08/07/2017"), format = "%d/%m/%Y"); date
DF <- data.frame(pacient, date); DF
count_visit <- function(x){
DF <- data.table(DF)
DTord<-DF[with(DF , order(DF $ date)), ]; DTord
DTord[,num_visit := order(date), by = pacient];DTord
DTordID <- DTord[with(DTord, order(DTord$pacient)), ]; DTordID
DTordID[,max_visit := max(num_visit), by = pacient];DTordID
framedatos <- as.data.frame(DTordID)
return(framedatos)}
REUP_visit <- count_visit(DF); head(REUP_visit)
pacient date num_visit max_visit
10 01/01/2018 1 5
10 02/05/2018 2 5
10 04/06/2018 3 5
10 10/11/2018 4 5
10 05/12/2018 5 5
11 02/01/2018 1 2
11 06/08/2018 2 2
12 01/01/2018 1 3
12 03/01/2018 2 3
12 06/03/2018 3 3
13 05/08/2018 1 2
13 05/08/2019 2 2
14 08/07/2017 1 1
15 05/07/2019 1 1
So far I have only managed to create a function that tells me the number of visits per patient and the maximum number of visits a patient has had (this is what I need for something else):
pacient date num_visit max_visit days_visit <180 future_hyperf past_hyperf
10 01/01/2018 1 5 0 1 no yes
10 02/05/2018 2 5 121 2 no yes
10 04/06/2018 3 5 33 3 yes yes
10 10/11/2018 4 5 159 4 yes yes
10 05/12/2018 5 5 25 5 yes yes
11 02/01/2018 1 2 0 1 no no
11 06/08/2018 2 2 216 1 no no
12 01/01/2018 1 3 0 1 no yes
12 03/01/2018 2 3 2 2 no yes
12 06/03/2018 3 3 62 3 yes yes
13 05/08/2018 1 2 0 1 no no
13 05/08/2019 2 2 365 1 no no
14 08/07/2017 1 1 0 1 no no
15 05/07/2019 1 1 0 1 no no
The output I need is one that has: "day_visit", "<180", "future_hyperf" and "past_hyperf".
The objective of the variable "day_visit" is to number the patient's first visit to the emergency room at 0 and then count the days between visits.
DF <- DF %>%
group_by(pacient) %>%
arrange(date) %>%
mutate(days_visit= date - lag(date, default = first(date)))
The variable "<180" would be the variable that number 1 the first time it comes, 2 the second (if it is <180 days with the previous visit), 3 (if it is <180 days with the previous visit) and so on . If, for example, the patient reaches 2 and the third visit does not meet <180 days, it would be necessary to put 1 again (the loop would be restarted).
The variable "future_hyperf" says yes or no. It is marked as if it made the future once the patient reaches 3 in the variable <180, it does not matter if the visits are later than 180 days and does not comply. Once the criterion is met, it is forever.
The variable "past_hyperf" converts all the patients that have if in the variable "future_hyperf" in itself also to the past.
Thank you!
SOLUTION
DF3 <- DF %>%
arrange(pacient, date) %>%
group_by(pacient) %>%
mutate(days_visit = as.integer(date - lag(date, default = first(date))) ,
less_180 = days_visit < 180) %>%
mutate(counter = rowid(pacient, cumsum(date - shift(date, fill=first(date)) > 180)),
future_hyperf = case_when(counter >= 3 ~ "yes",
TRUE ~ "no"),
past_hyperf = case_when(max(counter, na.rm = T) >= 3 ~ "yes",
TRUE ~ "no"))
DF3 <- DF3[with(DF3,order(pacient,date)),]
Try this:
pacient <- c(10, 10, 10, 10, 10, 11, 11, 12, 12, 12, 13, 13, 15, 14)
pacient
date <-
as.Date(
c(
"01/01/2018",
"02/05/2018",
"04/06/2018",
"10/11/2018",
"05/12/2018",
"02/01/2018",
"06/08/2018",
"01/01/2018",
"03/01/2018",
"06/03/2018",
"05/08/2018",
"05/08/2019",
"05/07/2019",
"08/07/2017"
),
format = "%d/%m/%Y"
)
date
DF <- data.frame(pacient, date)
DF
#packages
library(dplyr)
library(lubridate)
#time zone
lct <- Sys.getlocale("LC_TIME")
Sys.setlocale("LC_TIME", "C")
DF <- DF %>%
group_by(pacient) %>%
mutate(num_visit = cumsum(pacient) / pacient) %>% # number of visits
mutate(max_visit = max(num_visit)) %>% # max visit
mutate(days_visit = as.Date(date, "%d/%m/%Y") - lag(as.Date(date, "%d/%m/%Y"))) %>% # days between visits
mutate(minus_180_days = case_when(days_visit < 180 &
!is.na(days_visit) ~ num_visit,
TRUE ~ 1)) %>% # is days between visits < 180
mutate(future_hyperf = case_when(minus_180_days > 3 ~ "yes",
TRUE ~ "no")) %>% # future hyperf
mutate(past_hyperf = case_when(max(minus_180_days, na.rm = T) >= 3 ~ "yes",
TRUE ~ "no")) # past hyperf
Hope it helps
Here is how I would do. The explanation is in the annotation.
library(tidyverse)
DF %>%
group_by(pacient) %>% # group the data by "pacient"
mutate(lag_date = lag(date, n = 2)) %>% # create the variable of lag dates by 2 visits
mutate(date_diff = as.integer(date - lag_date)) %>% # Calculate the difference in dates
mutate(date_diff = case_when(is.na(date_diff) ~ 9999L, # replace NAs with 999 (cummin does not allow na.rm)
TRUE ~ date_diff)) %>% #
mutate(min_period = cummin(date_diff)) %>% # calculate the cumulative minimum of the differencce
mutate(future_hyperf = min_period < 180) %>% # check the cumulative min is less than 180
mutate(past_hyperf = min(min_period) < 180) %>%
ungroup()
## # A tibble: 14 x 7
## pacient date lag_date date_diff min_period future_hyperf past_hyperf
## <dbl> <date> <date> <int> <int> <lgl> <lgl>
## 1 10 2018-01-01 NA 9999 9999 FALSE TRUE
## 2 10 2018-05-02 NA 9999 9999 FALSE TRUE
## 3 10 2018-06-04 2018-01-01 154 154 TRUE TRUE
## 4 10 2018-11-10 2018-05-02 192 154 TRUE TRUE
## 5 10 2018-12-05 2018-06-04 184 154 TRUE TRUE
## 6 11 2018-01-02 NA 9999 9999 FALSE FALSE
## 7 11 2018-08-06 NA 9999 9999 FALSE FALSE
## 8 12 2018-01-01 NA 9999 9999 FALSE TRUE
## 9 12 2018-01-03 NA 9999 9999 FALSE TRUE
## 10 12 2018-03-06 2018-01-01 64 64 TRUE TRUE
## 11 13 2018-08-05 NA 9999 9999 FALSE FALSE
## 12 13 2019-08-05 NA 9999 9999 FALSE FALSE
## 13 15 2019-07-05 NA 9999 9999 FALSE FALSE
## 14 14 2017-07-08 NA 9999 9999 FALSE FALSE
I have a data frame as follows:
df <- data.frame(
Item=c("A","A","A","A","A","B","B","B","B","B"),
Date=c("2018-1-1","2018-2-1","2018-3-1","2018-4-1","2018-5-1","2018-1-1","2018-2-1",
"2018-3-1","2018-4-1","2018-5-1"),
Value=rnorm(10))
I want to mutate a new column grouped by Item, to count the number of values higher than 0 within the window of 3 (or any other integer I specify).
I am familiar with tidyverse, therefore, a dplyr solution would be most welcome.
Think zoo:: package if you want to roll anything.
df$new<-
zoo::rollsum( df$Value > 0, 3, fill = NA )
# Item Date Value new
#1 A 2018-1-1 0.5852699 NA
#2 A 2018-2-1 -0.7383377 1
#3 A 2018-3-1 -0.3157693 1
#4 A 2018-4-1 1.2475237 1
#5 A 2018-5-1 -1.5479757 1
#6 B 2018-1-1 -0.6913331 0
#7 B 2018-2-1 -0.2423809 0
#8 B 2018-3-1 -1.6363024 0
#9 B 2018-4-1 -0.3256263 1
#10 B 2018-5-1 0.3563144 NA
You have an option of the "window-position". Have a closer look at argument align = c("center", "left", "right").
So as a dplyr chain:
df %>% group_by(Item) %>% dplyr::mutate( new = zoo::rollsum( Value > 0, 3, fill = NA ))
You could use the RcppRoll package.
require(RcppRoll)
df$new <- df$new <- RcppRoll::roll_sum(df$Value > 0, 3, fill = NA)
Using Tidyverse:
df %>%
group_by(Item) %>%
dplyr::mutate(new = RcppRoll::roll_sum(Value > 0, 3, fill = NA))
Speedwise this is faster than the zoo Package:
n <- 10000
df <- data.frame(
Item = sample(LETTERS, n, replace = TRUE),
Value = rnorm(n))
df_grouped <- df %>%
group_by(Item)
microbenchmark::microbenchmark(
RcppRoll = df_grouped <- df_grouped %>% dplyr::mutate(new_RcppRoll = RcppRoll::roll_sum(Value > 0, 3, fill = NA)),
zoo = df_grouped <- df_grouped %>% dplyr::mutate(new_zoo = zoo::rollsum( Value > 0, 3, fill = NA ))
)
Results in:
Unit: milliseconds
expr min lq mean median uq max neval
RcppRoll 2.509003 2.741993 2.929227 2.83913 2.983726 5.832962 100
zoo 11.172920 11.785113 13.288970 12.43320 13.607826 25.879754 100
And
all.equal(df_grouped$new_RcppRoll, df_grouped$new_zoo)
TRUE
Item Date Value
<fct> <date> <int>
1 A 2018-01-01 3
2 B 2018-01-01 2
3 B 2018-02-01 -5
4 A 2018-02-01 -3
5 A 2018-03-01 4
6 B 2018-03-01 -2
7 A 2018-04-01 5
8 B 2018-04-01 0
9 A 2018-05-01 1
10 B 2018-05-01 -4
Changed rnorm example for clarity, used sample(-5:5):
> df <- df %>% mutate(greater_than = (Value>0)*Value) %>%
group_by(Item) %>% arrange(Date) %>% mutate(greater_than =
zoo::rollapplyr(greater_than, 3, sum, partial = T))
df %>% arrange(Item) %>% head(10)
Should look like this:
1 A 2018-01-01 3 3
2 A 2018-02-01 -3 3
3 A 2018-03-01 4 7
4 A 2018-04-01 5 9
5 A 2018-05-01 1 10
6 B 2018-01-01 2 2
7 B 2018-02-01 -5 2
8 B 2018-03-01 -2 2
9 B 2018-04-01 0 0
10 B 2018-05-01 -4 0
I have a table that has dates as a number and a value with each date. Now I'd like to add another column, weekSum, which contains the sum of value over the last week. However some dates are missing (so I can't always use the current and last 6 rows). My table looks like this:
df <- data.frame('date' = c(20160309, 20160310, 20160311, 20160312, 20160313, 20160314, 20160315, 20160317, 20160318, 20160319, 20160321), 'value' = c(1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11))
date value
20160309 1
20160310 2
20160311 3
20160312 4
20160313 5
20160314 6
20160315 7
20160316 8
20160318 9 #17th skipped
20160319 10
20160321 11 #20th skipped
I'd like to get the following as output:
date value weekSum
20160309 1 NA
20160310 2 NA
20160311 3 NA
20160312 4 NA
20160313 5 NA
20160314 6 NA
20160315 7 28 # 1+2+3+4+5+6+7
20160316 8 35 # 2+3+4+5+6+7+8
20160318 9 39 # 4+5+6+7+8+9
20160319 10 45 # 5+6+7+8+9+10
20160321 11 45 # 7+8+9+10+11
How can this be done?
1) Convert the data frame to zoo and define a weekSum function which subsets its input to the last week and sums that. Then use rollapplyr with coredata = FALSE so that it passes a zoo object with times, not just the core data, to the weekSum function.
library(zoo)
z <- read.zoo(df, format = "%Y%m%d")
weekSum <- function(z) sum(z[time(z) > tail(time(z), 1) - 7])
transform(df, weekSum = rollapplyr(z, 7, weekSum, fill = NA, coredata = FALSE))
giving:
date value weekSum
2016-03-09 20160309 1 NA
2016-03-10 20160310 2 NA
2016-03-11 20160311 3 NA
2016-03-12 20160312 4 NA
2016-03-13 20160313 5 NA
2016-03-14 20160314 6 NA
2016-03-15 20160315 7 28
2016-03-16 20160316 8 35
2016-03-18 20160318 9 39
2016-03-19 20160319 10 45
2016-03-21 20160321 11 45
2) An alternative is to fill in the value at the missing dates with zero and then just use rollsumr with width of 7. z is from (1).
z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0)
transform(df, weekSum = rollsumr(z0, 7, fill = NA)[z0 != 0])
With base R it can be done like this:
res <- merge(df, data.frame(date = seq(df$date[1], to = df$date[length(d)], by = "days")), all.y = TRUE)
res$weekSum <- NA
for(i in seq_along(res$sum)[-seq_len(6)]){
res$weekSum[i] <- sum(res$value[(i - 6):i], na.rm = TRUE)
}
res <- res[!is.na(res$value), ]
res
# date value sum weekSum
#1 2016-03-09 1 NA NA
#2 2016-03-10 2 NA NA
#3 2016-03-11 3 NA NA
#4 2016-03-12 4 NA NA
#5 2016-03-13 5 NA NA
#6 2016-03-14 6 NA NA
#7 2016-03-15 7 28 28
#9 2016-03-17 8 33 35
#10 2016-03-18 9 39 42
#11 2016-03-19 10 45 49
#13 2016-03-21 11 45 56
Here is an approach using tidyverse tools. This method uses tidyr::complete to construct the full date sequence, making it easy to take the current row and the previous 6 as suggested. Be careful here if there are
NA values in value to begin with, as currently those rows will be filtered out at the end. Tweaks possible to avoid this case if necessary.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
df <- data.frame('date' = c(20160309, 20160310, 20160311, 20160312, 20160313, 20160314, 20160315, 20160317, 20160318, 20160319, 20160321), 'value' = c(1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11))
df %>%
mutate(date = ymd(date)) %>%
complete(date = seq.Date(min(date), max(date), by = 1)) %>%
arrange(date) %>%
mutate(
newval = replace_na(value, 0),
weekSum = newval + lag(newval) + lag(newval, 2) + lag(newval, 3) +
lag(newval, 4) + lag(newval, 5) + lag(newval, 6)
) %>%
select(-newval) %>%
filter(!is.na(value))
#> # A tibble: 11 x 3
#> date value weekSum
#> <date> <dbl> <dbl>
#> 1 2016-03-09 1. NA
#> 2 2016-03-10 2. NA
#> 3 2016-03-11 3. NA
#> 4 2016-03-12 4. NA
#> 5 2016-03-13 5. NA
#> 6 2016-03-14 6. NA
#> 7 2016-03-15 7. 28.
#> 8 2016-03-17 8. 33.
#> 9 2016-03-18 9. 39.
#> 10 2016-03-19 10. 45.
#> 11 2016-03-21 11. 45.
Created on 2018-05-07 by the reprex package (v0.2.0).