I have to create a function (or loop) in R to detect hyper-frequent.
The requirement to detect hyper-frequent is to come 3 times in 180 days, if that requirement is met the person will be hyper-frequent, not only in the future, but in the past visits where he did not meet the hyper-frequent requirement as well.
pacient <- c(10,10,10,10,10,11,11,12,12,12,13, 13, 15, 14); pacient
date <- as.Date(c("01/01/2018","02/05/2018", "04/06/2018", "10/11/2018", "05/12/2018", "02/01/2018", "06/08/2018", "01/01/2018", "03/01/2018", "06/03/2018", "05/08/2018", "05/08/2019", "05/07/2019", "08/07/2017"), format = "%d/%m/%Y"); date
DF <- data.frame(pacient, date); DF
count_visit <- function(x){
DF <- data.table(DF)
DTord<-DF[with(DF , order(DF $ date)), ]; DTord
DTord[,num_visit := order(date), by = pacient];DTord
DTordID <- DTord[with(DTord, order(DTord$pacient)), ]; DTordID
DTordID[,max_visit := max(num_visit), by = pacient];DTordID
framedatos <- as.data.frame(DTordID)
return(framedatos)}
REUP_visit <- count_visit(DF); head(REUP_visit)
pacient date num_visit max_visit
10 01/01/2018 1 5
10 02/05/2018 2 5
10 04/06/2018 3 5
10 10/11/2018 4 5
10 05/12/2018 5 5
11 02/01/2018 1 2
11 06/08/2018 2 2
12 01/01/2018 1 3
12 03/01/2018 2 3
12 06/03/2018 3 3
13 05/08/2018 1 2
13 05/08/2019 2 2
14 08/07/2017 1 1
15 05/07/2019 1 1
So far I have only managed to create a function that tells me the number of visits per patient and the maximum number of visits a patient has had (this is what I need for something else):
pacient date num_visit max_visit days_visit <180 future_hyperf past_hyperf
10 01/01/2018 1 5 0 1 no yes
10 02/05/2018 2 5 121 2 no yes
10 04/06/2018 3 5 33 3 yes yes
10 10/11/2018 4 5 159 4 yes yes
10 05/12/2018 5 5 25 5 yes yes
11 02/01/2018 1 2 0 1 no no
11 06/08/2018 2 2 216 1 no no
12 01/01/2018 1 3 0 1 no yes
12 03/01/2018 2 3 2 2 no yes
12 06/03/2018 3 3 62 3 yes yes
13 05/08/2018 1 2 0 1 no no
13 05/08/2019 2 2 365 1 no no
14 08/07/2017 1 1 0 1 no no
15 05/07/2019 1 1 0 1 no no
The output I need is one that has: "day_visit", "<180", "future_hyperf" and "past_hyperf".
The objective of the variable "day_visit" is to number the patient's first visit to the emergency room at 0 and then count the days between visits.
DF <- DF %>%
group_by(pacient) %>%
arrange(date) %>%
mutate(days_visit= date - lag(date, default = first(date)))
The variable "<180" would be the variable that number 1 the first time it comes, 2 the second (if it is <180 days with the previous visit), 3 (if it is <180 days with the previous visit) and so on . If, for example, the patient reaches 2 and the third visit does not meet <180 days, it would be necessary to put 1 again (the loop would be restarted).
The variable "future_hyperf" says yes or no. It is marked as if it made the future once the patient reaches 3 in the variable <180, it does not matter if the visits are later than 180 days and does not comply. Once the criterion is met, it is forever.
The variable "past_hyperf" converts all the patients that have if in the variable "future_hyperf" in itself also to the past.
Thank you!
SOLUTION
DF3 <- DF %>%
arrange(pacient, date) %>%
group_by(pacient) %>%
mutate(days_visit = as.integer(date - lag(date, default = first(date))) ,
less_180 = days_visit < 180) %>%
mutate(counter = rowid(pacient, cumsum(date - shift(date, fill=first(date)) > 180)),
future_hyperf = case_when(counter >= 3 ~ "yes",
TRUE ~ "no"),
past_hyperf = case_when(max(counter, na.rm = T) >= 3 ~ "yes",
TRUE ~ "no"))
DF3 <- DF3[with(DF3,order(pacient,date)),]
Try this:
pacient <- c(10, 10, 10, 10, 10, 11, 11, 12, 12, 12, 13, 13, 15, 14)
pacient
date <-
as.Date(
c(
"01/01/2018",
"02/05/2018",
"04/06/2018",
"10/11/2018",
"05/12/2018",
"02/01/2018",
"06/08/2018",
"01/01/2018",
"03/01/2018",
"06/03/2018",
"05/08/2018",
"05/08/2019",
"05/07/2019",
"08/07/2017"
),
format = "%d/%m/%Y"
)
date
DF <- data.frame(pacient, date)
DF
#packages
library(dplyr)
library(lubridate)
#time zone
lct <- Sys.getlocale("LC_TIME")
Sys.setlocale("LC_TIME", "C")
DF <- DF %>%
group_by(pacient) %>%
mutate(num_visit = cumsum(pacient) / pacient) %>% # number of visits
mutate(max_visit = max(num_visit)) %>% # max visit
mutate(days_visit = as.Date(date, "%d/%m/%Y") - lag(as.Date(date, "%d/%m/%Y"))) %>% # days between visits
mutate(minus_180_days = case_when(days_visit < 180 &
!is.na(days_visit) ~ num_visit,
TRUE ~ 1)) %>% # is days between visits < 180
mutate(future_hyperf = case_when(minus_180_days > 3 ~ "yes",
TRUE ~ "no")) %>% # future hyperf
mutate(past_hyperf = case_when(max(minus_180_days, na.rm = T) >= 3 ~ "yes",
TRUE ~ "no")) # past hyperf
Hope it helps
Here is how I would do. The explanation is in the annotation.
library(tidyverse)
DF %>%
group_by(pacient) %>% # group the data by "pacient"
mutate(lag_date = lag(date, n = 2)) %>% # create the variable of lag dates by 2 visits
mutate(date_diff = as.integer(date - lag_date)) %>% # Calculate the difference in dates
mutate(date_diff = case_when(is.na(date_diff) ~ 9999L, # replace NAs with 999 (cummin does not allow na.rm)
TRUE ~ date_diff)) %>% #
mutate(min_period = cummin(date_diff)) %>% # calculate the cumulative minimum of the differencce
mutate(future_hyperf = min_period < 180) %>% # check the cumulative min is less than 180
mutate(past_hyperf = min(min_period) < 180) %>%
ungroup()
## # A tibble: 14 x 7
## pacient date lag_date date_diff min_period future_hyperf past_hyperf
## <dbl> <date> <date> <int> <int> <lgl> <lgl>
## 1 10 2018-01-01 NA 9999 9999 FALSE TRUE
## 2 10 2018-05-02 NA 9999 9999 FALSE TRUE
## 3 10 2018-06-04 2018-01-01 154 154 TRUE TRUE
## 4 10 2018-11-10 2018-05-02 192 154 TRUE TRUE
## 5 10 2018-12-05 2018-06-04 184 154 TRUE TRUE
## 6 11 2018-01-02 NA 9999 9999 FALSE FALSE
## 7 11 2018-08-06 NA 9999 9999 FALSE FALSE
## 8 12 2018-01-01 NA 9999 9999 FALSE TRUE
## 9 12 2018-01-03 NA 9999 9999 FALSE TRUE
## 10 12 2018-03-06 2018-01-01 64 64 TRUE TRUE
## 11 13 2018-08-05 NA 9999 9999 FALSE FALSE
## 12 13 2019-08-05 NA 9999 9999 FALSE FALSE
## 13 15 2019-07-05 NA 9999 9999 FALSE FALSE
## 14 14 2017-07-08 NA 9999 9999 FALSE FALSE
Related
I'm looking to find >=4 unique events that all occur within a group within a 90 day period and then flag the ID.
Just a test example:
library(dplyr)
set.seed(1)
test <- data.frame(
PATID = sample(1:1e4, 1e5, replace = TRUE),
PROV = sample(1:50, 1e5, replace = TRUE),
GROUP = sample(0:1, 1e5, replace = TRUE),
DATE = as.Date(sample(
as.Date("2020-01-01"):as.Date("2020-12-31"),
1e5,
replace = TRUE
), origin = "1970-01-01")
)
If we look at PATID==5 we can see there are 4 unique PROVs with overlapping dates within 90 days and within our group of interest and so should be flagged.
> test %>% filter(PATID==5) %>% arrange(GROUP,DATE)
PATID PROV GROUP DATE
1 5 2 0 2020-05-07
2 5 3 0 2020-05-20
3 5 3 0 2020-11-15
4 5 49 0 2020-12-14
5 5 45 1 2020-02-16
6 5 50 1 2020-03-19
7 5 38 1 2020-03-25
8 5 27 1 2020-03-29
9 5 42 1 2020-08-30
10 5 46 1 2020-11-03
11 5 25 1 2020-11-13
12 5 29 1 2020-12-26
> as.Date("2020-03-29")-as.Date("2020-02-16")<=90
[1] TRUE
Ultimately, I'm looking for the proportion of GROUP==1 vs GROUP==0 with >=4 unique PROVs within 90 days. Ideally I'd prefer using data.table simply due to the scale of data.
Trying out some code:
test %>%
filter(PATID %in% 1:5) %>%
group_by(PATID,GROUP) %>%
arrange(GROUP, DATE) %>%
mutate(lag = DATE - lag(DATE),
day_count = case_when(lag <= 90 ~ TRUE,
is.na(lag) ~ TRUE,
TRUE ~ FALSE)) %>%
mutate(crit = cumsum_reset(day_count)) %>%
ungroup() %>%
group_by(PATID) %>%
mutate(flag = case_when(max(crit) >= 4 ~ 1,
TRUE ~ 0)) %>%
arrange(PATID)
Getting closer, just need to sort out the 90 window versus just crudely testing if each date is within 90 days.
Maybe the following is what you are after. Please check if the logic is what you meant. I left more explicit than necessary so that the idea can be more easily understood. The main idea is that if after sorting there is a observation from same PATDI & GROUP that is within 90 days from the 3rd lag diff_3 := DATE - shift(DATE, 3), than it should be flagged. This is done by checking diff_check = diff_3<=90. If any observation for any PATID/GROUP is flagged, the corresponding ID will be flagged by the keep = max(diff_check, na.rm = TRUE, pmin = 0) after grouping by only PATID.
Using the third lag to account for 4 or more and not strictly more than 4 observations.
Does it, all in all, make any sense?
library(data.table)
set.seed(1)
test <- data.frame(
PATID = sample(1:1e4, 1e5, replace = TRUE),
PROV = sample(1:50, 1e5, replace = TRUE),
GROUP = sample(0:1, 1e5, replace = TRUE),
DATE = as.Date(sample(
as.Date("2020-01-01"):as.Date("2020-12-31"),
1e5,
replace = TRUE
), origin = "1970-01-01")
)
test %>% filter(PATID==5) %>% arrange(GROUP,DATE)
#> Error in test %>% filter(PATID == 5) %>% arrange(GROUP, DATE): could not find function "%>%"
dt <- as.data.table(test)
dt <- dt[order(PATID, GROUP, DATE)]
dt[, diff_3 := DATE - shift(DATE, 3), by = c("PATID", "GROUP")]
# check amount of unique values of PROV in previous 4 observations
dt[, unique_last_4 := frollapply(x = PROV, n = 4, FUN = uniqueN), by = c("PATID", "GROUP")]
# check if within 90 days and unique PROVs
dt[, diff_check := diff_3<=90 & unique_last_4==4, by = c("PATID", "GROUP")]
# final check to flag all observations of ID that satisfied at least once the above checks
dt[, to_keep := max(diff_check, na.rm = TRUE, pmin = 0), by = "PATID"]
# NOTE: unsure if you mean to group only by PATID here or by PATID & GROUP.
head(dt[to_keep==1], 20)
#> PATID PROV GROUP DATE diff_3 unique_last_4 diff_check to_keep
#> 1: 5 2 0 2020-05-07 NA days NA NA 1
#> 2: 5 3 0 2020-05-20 NA days NA NA 1
#> 3: 5 3 0 2020-11-15 NA days NA NA 1
#> 4: 5 49 0 2020-12-14 221 days 3 FALSE 1
#> 5: 5 45 1 2020-02-16 NA days NA NA 1
#> 6: 5 50 1 2020-03-19 NA days NA NA 1
#> 7: 5 38 1 2020-03-25 NA days NA NA 1
#> 8: 5 27 1 2020-03-29 42 days 4 TRUE 1
#> 9: 5 42 1 2020-08-30 164 days 4 FALSE 1
#> 10: 5 46 1 2020-11-03 223 days 4 FALSE 1
#> 11: 5 25 1 2020-11-13 229 days 4 FALSE 1
#> 12: 5 29 1 2020-12-26 118 days 4 FALSE 1
#> 13: 7 1 0 2020-04-10 NA days NA NA 1
#> 14: 7 44 0 2020-04-29 NA days NA NA 1
#> 15: 7 27 0 2020-05-05 NA days NA NA 1
#> 16: 7 41 0 2020-06-11 62 days 4 TRUE 1
#> 17: 7 35 0 2020-06-30 62 days 4 TRUE 1
#> 18: 7 11 0 2020-12-18 227 days 4 FALSE 1
#> 19: 7 24 1 2020-12-24 NA days NA NA 1
#> 20: 7 13 1 2020-12-29 NA days NA NA 1
Created on 2021-06-22 by the reprex package (v2.0.0)
dplyr version
test_keep <- test %>% arrange(PATID, GROUP, DATE) %>%
head(1000) %>% # otherwise it takes too long in my pc, which shows data.table's efficiency!
group_by(PATID, GROUP) %>%
mutate(diff_3 = DATE - lag(DATE, 3),
diff_check = diff_3<=90,
unique_last_4 = frollapply(x = PROV, n = 4, FUN = uniqueN)
) %>% group_by(PATID) %>%
mutate(keep = max(diff_check, na.rm = TRUE, pmin = 0)) %>%
arrange(PATID, GROUP)
test_keep %>% filter(keep==1) %>% head(20)
Based on I'm looking for the annual "group" proportion of patients that visit >=4 providers within 90 days, you can try this:
library(data.table) #data.table 1.13.2
setDT(test)[, c("d90ago", "d90aft") := .(DATE - 90L, DATE + 90L)]
setkey(test, PATID, DATE)
test[, grp :=
.SD[.SD, on=.(PATID, DATE>=d90ago, DATE<=d90aft), by=.EACHI, +(length(unique(x.PROV))>=4L)]$V1
]
The above allows PROV within overlapping windows of 90 days to be re-used.
There's some ambiguities in the question, so this may not be quite right. I tried doing this using dplyr and local data frames, but the self-join causes an overflow (100,000 times 100,000).
It seems to work using data.table and using PostgreSQL, which has an OVERLAPS function.
(Note that I used lower-case variable names to make working with SQL easier.)
In the answer below, I start with a patient visit ((patid, prov, group, date) combination) and look forward 90 days to capture all visits by that patient (patid) to other providers (prov != prov_other). I then count the number of distinct providers in that lookahead period (this will be NA when there are no visits, as when looking at a patient's last visit in the sample). I then count the number of visits where the number of additional distinct providers in the subsequent 90 days is 3 or more.
Finally, I group by (group, year) and count the proportion of visits that are followed by visits to at least three other providers during the subsequent 90 days. Given the way the data are generated, it is no surprise that the two groups look similar on this metric.
Note that each patient visit forms a unit of observation here. In practice, it may make sense to aggregate by (say) (patid, year) before calculating statistics or do some other kind of aggregation.
library(data.table)
library(dplyr, warn.conflicts = FALSE)
set.seed(1)
test <- tibble(
patid = sample(1:1e4, 1e5, replace = TRUE),
prov = sample(1:50, 1e5, replace = TRUE),
group = sample(0:1, 1e5, replace = TRUE),
date = as.Date(sample(
as.Date("2020-01-01"):as.Date("2020-12-31"),
1e5,
replace = TRUE
), origin = "1970-01-01")) %>%
as.data.table()
test
#> patid prov group date
#> 1: 1017 6 1 2020-08-03
#> 2: 8004 34 0 2020-12-15
#> 3: 4775 32 0 2020-06-21
#> 4: 9725 47 1 2020-09-25
#> 5: 8462 15 0 2020-03-05
#> ---
#> 99996: 949 47 0 2020-07-05
#> 99997: 2723 37 0 2020-08-18
#> 99998: 201 27 1 2020-01-06
#> 99999: 163 9 0 2020-03-06
#> 100000: 3204 48 1 2020-11-17
df_overlap <-
test %>%
inner_join(test, by = "patid", suffix = c("", "_other")) %>%
filter(prov != prov_other) %>%
filter(date_other >= date & date_other <= date + 90L)
mt_4_provs_df <-
df_overlap %>%
group_by(patid, prov, group, date) %>%
summarize(n_providers = n_distinct(prov_other), .groups = "drop")
results <-
test %>%
left_join(mt_4_provs_df, by = c("patid", "prov", "group", "date")) %>%
mutate(mt_4_provs = n_providers >= 3,
year = year(date)) %>%
group_by(group, year) %>%
summarize(prop_mt_4_provs = mean(mt_4_provs, na.rm = TRUE),
.groups = "drop")
results
#> # A tibble: 2 x 3
#> group year prop_mt_4_provs
#> <int> <int> <dbl>
#> 1 0 2020 0.426
#> 2 1 2020 0.423
Created on 2021-06-22 by the reprex package (v2.0.0)
I am trying to use the apply function to rows within a grouped dataframe to check for the existence of other rows within that group that match certain conditions dependent on each row. I am able to get this to work for one group but not for all.
For example, with no grouping:
library(dplyr)
id <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2)
station <- c(1, 2, 3, 3, 2, 2, 1, 1, 3, 2, 2)
timeslot <- c(13, 14, 20, 21, 24, 23, 8, 9, 10, 15, 16)
df <- data.frame(id, station, timeslot)
s <- 2
df <-
df %>%
filter(id == 1) %>%
arrange(id, timeslot) %>%
mutate(match = ifelse(station == s, apply(., 1, function(x) (any(as.numeric(x[3] + 1) == .$timeslot))), FALSE))
id station timeslot match
1 1 1 13 FALSE
2 1 2 14 FALSE
3 1 3 20 FALSE
4 1 3 21 FALSE
5 1 2 23 TRUE
6 1 2 24 FALSE
In the above code, for each station 2 row, I am trying to check all other rows to see if there exists a timeslot with a value of one greater (for any station). This works as expected.
Then, I go on to apply this to a grouped dataframe:
df <-
df %>%
group_by(id) %>%
arrange(id, timeslot) %>%
mutate(match = ifelse(station == s, apply(., 1, function(x) (any(as.numeric(x[3] + 1) == .$timeslot))), FALSE))
id station timeslot match
<int> <int> <int> <lgl>
1 1 1 13 FALSE
2 1 2 14 TRUE
3 1 3 20 FALSE
4 1 3 21 FALSE
5 1 2 23 TRUE
6 1 2 24 FALSE
7 2 1 8 FALSE
8 2 1 9 FALSE
9 2 3 10 FALSE
10 2 2 15 FALSE
11 2 2 16 TRUE
and get some unwanted results. It seems like it is not applied by group and I can't figure out how to fix this. How can I apply this function so that only the other rows within a group are checked? In reality, my dataset is much bigger and the conditions are more complex, so it is not running quickly either.
Thanks in advance
Edit: I should add that I have also tried a solution using the arrange() and lead() function but since some timeslot values are shared by many stations in my larger dataset I could not get this to work
This seems to work:
df %>%
group_by(id) %>%
arrange(id, timeslot) %>%
mutate(match = station == s & ((timeslot + 1) %in% timeslot))
# # A tibble: 11 x 4
# # Groups: id [2]
# id station timeslot match
# <dbl> <dbl> <dbl> <lgl>
# 1 1 1 13 FALSE
# 2 1 2 14 FALSE
# 3 1 3 20 FALSE
# 4 1 3 21 FALSE
# 5 1 2 23 TRUE
# 6 1 2 24 FALSE
# 7 2 1 8 FALSE
# 8 2 1 9 FALSE
# 9 2 3 10 FALSE
# 10 2 2 15 TRUE
# 11 2 2 16 FALSE
My sincere apologies if I understood the question wrong. This does what I understand from the question:
df$match = apply(df, 1, function(line) any(df$id == line[1] &
df$station == line[2] &
df$timeslot == line[3] + 1))
The result then is
id station timeslot match
1 1 1 13 FALSE
2 1 2 14 FALSE
3 1 3 20 TRUE
4 1 3 21 FALSE
5 1 2 24 FALSE
6 1 2 23 TRUE
7 2 1 8 TRUE
8 2 1 9 FALSE
9 2 3 10 FALSE
10 2 2 15 TRUE
11 2 2 16 FALSE
I have these two toy example tables:
Table 1:
attendance_events <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456","RC456","RA123","RB123","RC123","RA456","RB456","RC456"),
dates = c("2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01","2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02","2020-02-02"),
attendance = c(1,1,1,0,1,1,0,0,1,0,0,1),
stringsAsFactors = F)
attendance_events
student_id dates attendance
1 RA123 2020-02-01 1
2 RB123 2020-02-01 1
3 RC123 2020-02-01 1
4 RA456 2020-02-01 0
5 RB456 2020-02-01 1
6 RC456 2020-02-01 1
7 RA123 2020-02-02 0
8 RB123 2020-02-02 0
9 RC123 2020-02-02 1
10 RA456 2020-02-02 0
11 RB456 2020-02-02 0
12 RC456 2020-02-02 1
Table2:
all_students <- data.frame(student_id = c("RA123","RB123","RC123","RA456","RB456",'RC456'),
school_id = c(1,1,1,1,1,2),
grade_level = c(10,10,9,9,11,11),
date_of_birth = c("1990-02-02","1990-02-02","1991-01-01","1991-02-01","1989-02-02","1989-02-02"),
hometown = c("farm","farm","farm","farm","farm","city"),
stringsAsFactors = F)
> all_students
student_id school_id grade_level date_of_birth hometown
1 RA123 1 10 1990-02-02 farm
2 RB123 1 10 1990-02-02 farm
3 RC123 1 9 1991-01-01 farm
4 RA456 1 9 1991-02-01 farm
5 RB456 1 11 1989-02-02 farm
6 RC456 2 11 1989-02-02 city
attendance in attendance_events is 0 if the student was absent that day.
My question is what is the most efficient way in R to find the grade_level that had the largest drop off in attendance between "2020-02-01" and "2020-02-02"
My code is:
#Only include absences because it will be a smaller dataset
att_ws_alt <- inner_join(attendance_events, all_students[,c("student_id","grade_level")], by = "student_id") %>%
filter(attendance == 0)
#Set days to check between
date_from <- "2020-02-01"
date_to <- "2020-02-02"
#Continously pipe to not have to store and reference(?)
att_drop_alt <- att_ws_alt %>%
filter(dates %in% c(date_from, date_to)) %>%
group_by(grade_level,dates) %>%
summarize(absence_bydate = n()) %>%
dcast(grade_level ~ dates) %>%
sapply(FUN = function(x) { x[is.na(x)] <- 0; x}) %>%
as.data.frame() %>%
mutate("absence_change" = .[,3] - .[,2]) %>%
select(grade_level, absence_change) %>%
arrange(desc(absence_change))
>att_drop_alt
grade_level absence_change
1 10 2
2 11 1
3 9 0
However, this feels a bit complex for what seems like a reasonably simple question. I want to see other ways R programmers could answer this question, ideally for better performance but even readability would be good to see.
Thanks community!
With data.table
library(data.table)
setDT(attendance_events)[all_students, .SD[, .(sum(attendance)),
.(grade_level, dates)], on = .(student_id)][,
.(attendanace_change = diff(rev(V1))), .(grade_level)]
# grade_level attendanace_change
#1: 10 2
#2: 9 0
#3: 11 1
I guess this is a little more concise:
left_join(attendance_events, all_students, by = "student_id") %>%
group_by(grade_level, dates) %>%
summarise(attendance = sum(attendance)) %>%
group_by(grade_level) %>%
summarize(attendance_change = diff(attendance))
#> # A tibble: 3 x 2
#> grade_level attendance_change
#> <dbl> <dbl>
#> 1 9 0
#> 2 10 -2
#> 3 11 -1
Of course, if you want to count absences instead of attendances, just put a minus sign in front of the diff on the last line.
Sorry if this doesn't exactly answer your question, but I wouldn't want to unfairly accuse the students of being more absent then they were ;)
library(dplyr)
all_students %>%
left_join(attendance_events) %>%
mutate(dates = as.Date(dates)) %>%
group_by(grade_level, dates) %>%
summarise(NAbs = sum(ifelse(attendance == 0, 1, 0)),
N = n(),
pctAbs = NAbs / n() * 100) %>%
arrange(dates) %>%
mutate(change = pctAbs - lag(pctAbs)) %>%
ungroup() %>%
arrange(change)
# A tibble: 6 x 6
dates grade_level NAbs N pctAbs change
<date> <dbl> <dbl> <int> <dbl> <dbl>
1 2020-02-02 9 1 2 50 0
2 2020-02-02 11 1 2 50 50
3 2020-02-02 10 2 2 100 100
4 2020-02-01 9 1 2 50 NA
5 2020-02-01 10 0 2 0 NA
6 2020-02-01 11 0 2 0 NA
My dataset is large, containing many observations (Dependent variable = DV) on individuals (Name) across set periods (Period) of a testing session. A small example of my dataset is as follows:
ExampleData <- data.frame(Name = c("Tom","Tom","Tom","Tom","Tom","Tom","Tom","Tom", "Tom", "Tom",
"Ben","Ben","Ben","Ben","Ben","Ben","Ben","Ben", "Ben", "Ben"),
Period = c(0,0,1,1,1,0,0,0,1,1,
0,0,0,1,1,1,0,0,1,1),
DV = runif(20, 1.5, 2.8))
When ExampleData$Period==1 an individual is undergoing an exercise test, which varies in time/ length. Breaks in between each test are represented by ExampleData$Period==0. To avoid manually entering when a person is undergoing a test and adding the sequential periods in, I wish to include a column that declares when a group of 1's, seperated by a group of 0's, is a new period - across each person's data. How do I please go about doing this?
My anticipated output would be:
ExampleData$Descriptor <- c(NA,NA,"Period One", "Period One","Period One",NA,NA,NA,"Period Two","Period Two",
NA,NA,NA,"Period One","Period One","Period One",NA,NA,"Period Two","Period Two")
My question is similar to another of mine, located here, although I now have multiple entries for each individual. I have tried the dplyr syntax of:
Test_df <- ExampleData %>%
mutate(
Descriptor = case_when(
Period > 0 ~ "Period",
Period == 0 ~ "Rest"),
rleid = cumsum(Descriptor != lag(Descriptor, 1, default = "NA")),
Descriptor = case_when(
Descriptor == "Period" ~ paste0(Descriptor, rleid %/% 2),
TRUE ~ "Rest"),
rleid = NULL
)
Although, how do I account for each different Name/ individual in my dataset?
Thank you.
Here's an alternative approach with dplyr
library(dplyr)
ExampleData %>%
group_by(Name) %>%
mutate(Descriptor = with(rle(Period == 1),
rep(replace(paste("Period", cumsum(values)), !values, NA), lengths)))
# # A tibble: 20 x 4
# # Groups: Name [2]
# Name Period DV Descriptor
# <fctr> <dbl> <dbl> <chr>
# 1 Tom 0 2.641044 <NA>
# 2 Tom 0 2.692745 <NA>
# 3 Tom 1 1.515797 Period 1
# 4 Tom 1 2.601471 Period 1
# 5 Tom 1 1.669399 Period 1
# 6 Tom 0 2.700371 <NA>
# 7 Tom 0 1.993971 <NA>
# 8 Tom 0 2.203379 <NA>
# 9 Tom 1 2.488742 Period 2
# 10 Tom 1 1.596458 Period 2
# 11 Ben 0 2.578924 <NA>
# 12 Ben 0 1.916804 <NA>
# 13 Ben 0 2.676466 <NA>
# 14 Ben 1 2.508759 Period 1
# 15 Ben 1 2.447217 Period 1
# 16 Ben 1 2.728756 Period 1
# 17 Ben 0 2.326854 <NA>
# 18 Ben 0 1.748016 <NA>
# 19 Ben 1 1.703044 Period 2
# 20 Ben 1 1.783434 Period 2
Here is an option using data.table
library(data.table)
setDT(ExampleData)[ , grp := rleid(Period == 1), .(Name)][Period == 1,
Descriptor := paste("Period", match(grp, unique(grp))), Name][, grp := NULL][]
# Name Period DV Descriptor
# 1: Tom 0 2.764916 NA
# 2: Tom 0 1.537837 NA
# 3: Tom 1 1.848110 Period 1
# 4: Tom 1 2.621724 Period 1
# 5: Tom 1 2.206875 Period 1
# 6: Tom 0 1.715299 NA
# 7: Tom 0 1.882378 NA
# 8: Tom 0 2.244155 NA
# 9: Tom 1 2.094944 Period 2
#10: Tom 1 1.713493 Period 2
#11: Ben 0 1.794261 NA
#12: Ben 0 1.608199 NA
#13: Ben 0 2.053490 NA
#14: Ben 1 1.791563 Period 1
#15: Ben 1 1.652090 Period 1
#16: Ben 1 2.510483 Period 1
#17: Ben 0 2.345984 NA
#18: Ben 0 2.754110 NA
#19: Ben 1 1.675527 Period 2
#20: Ben 1 1.709622 Period 2
Base R option:
unlist(with(ExampleData, tapply(Period, Name, function(x) c(0, cumsum(ifelse(diff(x) < 0, 0, diff(x)))) * x)))
I was able to successfully complete this by running the following:
Test_df <- ExampleData %>%
group_by(Name) %>%
mutate(
Descriptor = case_when(
Period > 0 ~ "Period",
Period == 0 ~ "Rest"),
rleid = cumsum(Descriptor != lag(Descriptor, 1, default = "NA")),
Descriptor = case_when(
Descriptor == "Period" ~ paste0(Descriptor, rleid %/% 2),
TRUE ~ "Rest"),
rleid = NULL
)
I also used "Rest" instead of NA as this more accurately depicts what transpired.
I have the following dataset:
df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)),
complete_order = c(rep(c(1,0,0,1), 2)),
order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21')))
library(lubridate)
df$order_date = as_date(df$order_date)
user_id complete_order order_date
1 1 2015-01-28
1 0 2015-01-31
1 0 2015-02-08
1 1 2015-02-23
2 1 2015-01-25
2 0 2015-01-28
2 0 2015-02-06
2 1 2015-02-21
I'm trying to calculate the difference in days between only completed orders for each user. The desirable outcome would look like this:
user_id complete_order order_date complete_order_time_diff
<fctr> <fctr> <date> <time>
1 1 2015-01-28 NA days
1 0 2015-01-31 3 days
1 0 2015-02-08 11 days
1 1 2015-02-23 26 days
2 1 2015-01-25 NA days
2 0 2015-01-28 3 days
2 0 2015-02-06 12 days
2 1 2015-02-21 27 days
when I try this solution:
library(dplyr)
df %>%
group_by(user_id) %>%
mutate(complete_order_time_diff = order_date[complete_order==1]-lag(order_date[complete_order==1))
it returns the error:
Error: incompatible size (3), expecting 4 (the group size) or 1
Any help with this will be great, thank you!
try this
library(dplyr)
df %>% group_by(user_id, complete_order) %>%
mutate(c1 = order_date - lag(order_date)) %>%
group_by(user_id) %>% mutate(c2 = order_date - lag(order_date)) %>% ungroup %>%
mutate(complete_order_time_diff = ifelse(complete_order==0, c2, c1)) %>%
select(-c(c1, c2))
Update
for multiple cancelled orders
df %>% mutate(c3=cumsum( complete_order != "0")) %>% group_by(user_id, complete_order) %>%
mutate(c1 = order_date - lag(order_date)) %>%
group_by(user_id) %>% mutate(c2 = order_date - lag(order_date)) %>%
mutate(c2=as.numeric(c2)) %>% group_by(user_id, c3) %>%
mutate(c2=cumsum(ifelse(complete_order==1, 0, c2))) %>% ungroup %>%
mutate(complete_order_time_diff = ifelse(complete_order==0, c2, c1)) %>%
select(-c(c1, c2, c3))
logic
c3 is an id every time there is an order (i.e. complete_order not 0) to increment by 1.
c1 calculates the day difference bu user_id (but for non complete orders the result is wrong)
c2 fixes this inconsistency of c1 with respect to non complete orders.
hope this clears things.
I would suggest you work with combinations of group_by() and mutate(cumsum()) to better understand the results of having more than one grouped variable.
It seems that you're looking for the distance of each order from the last completed one. Having a binary vector, x, c(NA, cummax(x * seq_along(x))[-length(x)]) gives the indices of the last "1" seen before each element. Then, subtracting each element of "order_date" from the "order_date" at that respective index gives the desired output. E.g.
set.seed(1453); x = sample(0:1, 10, TRUE)
set.seed(1821); y = sample(5, 10, TRUE)
cbind(x, y,
last_x = c(NA, cummax(x * seq_along(x))[-length(x)]),
y_diff = y - y[c(NA, cummax(x * seq_along(x))[-length(x)])])
# x y last_x y_diff
# [1,] 1 3 NA NA
# [2,] 0 3 1 0
# [3,] 1 5 1 2
# [4,] 0 1 3 -4
# [5,] 0 3 3 -2
# [6,] 1 5 3 0
# [7,] 1 1 6 -4
# [8,] 0 3 7 2
# [9,] 0 4 7 3
#[10,] 1 5 7 4
On your data, first format df for convenience:
df$order_date = as.Date(df$order_date)
df$complete_order = df$complete_order == "1" # lose the 'factor'
And, then, either apply the above approach after a group_by:
library(dplyr)
df %>% group_by(user_id) %>%
mutate(time_diff = order_date -
order_date[c(NA, cummax(complete_order * seq_along(complete_order))[-length(complete_order)])])
, or, perhaps give a try on operations that avoid grouping (assuming ordered "user_id") after accounting for the indices where "user_id" changes:
# save variables to vectors and keep a "logical" of when "id" changes
id = df$user_id
id_change = c(TRUE, id[-1] != id[-length(id)])
compl = df$complete_order
dord = df$order_date
# accounting for changes in "id", locate last completed order
i = c(NA, cummax((compl | id_change) * seq_along(compl))[-length(compl)])
is.na(i) = id_change
dord - dord[i]
#Time differences in days
#[1] NA 3 11 26 NA 3 12 27
I think you can add a filter function in place of the subsetting with order_date[complete_order == 1] and make sure the order_date (and other variables) are the correct data types by adding stringsAsFactors = F to data.frame()):
df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)),
complete_order = c(rep(c(1,1,0,1), 2)),
order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21')),
stringsAsFactors = F)
df$order_date <- lubridate::ymd(df$order_date)
df %>%
group_by(user_id) %>%
filter(complete_order == 1) %>%
mutate(complete_order_time_diff = order_date - lag(order_date))
This returns the time until the next complete order (and NA if there is not one):
user_id complete_order order_date complete_order_time_diff
<chr> <chr> <date> <time>
1 1 1 2015-01-28 NA days
2 1 1 2015-01-31 3 days
3 1 1 2015-02-23 23 days
4 2 1 2015-01-25 NA days
5 2 1 2015-01-28 3 days
6 2 1 2015-02-21 24 days