This question already has an answer here:
finding overlapping time between start time and end time of individuals in a group
(1 answer)
Closed 3 years ago.
I have
household person start time end time
1 1 07:45:00 21:45:00
1 2 09:45:00 17:45:00
1 3 22:45:00 23:45:00
1 4 08:45:00 01:45:00
1 1 23:50:00 24:00:00
2 1 07:45:00 21:45:00
2 2 016:45:00 22:45:00
I want to find a column to find overlapping time between family members.
The indicator is: if a person's start and end time has intersection with another member's is 1 otherwise 0
In the above example first family, the time of first, second and forth persons have intersection so indicator is 1 and third and fifth rows doesn't have intersection with non of the other people in the household.
output:
household person start time end time overlap
1 1 07:45:00 21:45:00 1
1 2 09:45:00 17:45:00 1
1 3 22:45:00 23:45:00 0
1 4 08:45:00 01:45:00 1
1 1 23:50:00 24:00:00 0
2 1 07:45:00 21:45:00 1
2 2 016:45:00 22:45:00 1
data with dput format:
structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L, 2L), PERNO = c(1,
1, 1, 1, 1, 1), arr = structure(c(30300, 35280, 37200, 32400,
34200, 39600), class = c("hms", "difftime"), units = "secs"),
dep = structure(c(34200, 36300, 61800, 33600, 37800, 50400
), class = c("hms", "difftime"), units = "secs")), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
SAMPN = 1:2, PERNO = c(1, 1), .rows = list(1:3, 4:6)), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
I have tried a tidyverse solution:
library(tidyverse)
df = structure(list(SAMPN = c(1L, 1L, 1L, 2L, 2L, 2L),
PERNO = c(1:3, 1:3), arr = structure(c(30300, 35280, 37200, 32400,
34200, 39600), class = c("hms", "difftime"), units = "secs"),
dep = structure(c(34200, 36300, 61800, 33600, 37800, 50400), class = c("hms", "difftime"), units = "secs")), class = c("grouped_df","tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(SAMPN = 1:2, PERNO = c(1, 1), .rows = list(1:3, 4:6)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
Then, I added:
df %>% group_by(SAMPN) %>%
mutate(
arr_min = mapply(function(x) min(arr[-x]), 1:n()),
dep_max = mapply(function(x) max(dep[-x]), 1:n()),
overlap = ifelse(arr<arr_min | dep>dep_max, 0, 1)
)
You will get:
SAMPN PERNO arr dep arr_min dep_max overlap
<int> <int> <time> <time> <dbl> <dbl> <dbl>
1 1 1 08:25 09:30 35280 61800 0
2 1 2 09:48 10:05 30300 61800 1
3 1 3 10:20 17:10 30300 36300 0
4 2 1 09:00 09:20 34200 50400 0
5 2 2 09:30 10:30 32400 50400 1
6 2 3 11:00 14:00 32400 37800 0
You basically compare the current arr and dep with arr_min (min(arr) value excluding the current case) and dep_max (max(dep) excluding the current case).
Tidyverse solution
Here's a solution in tidyverse syntax. The basic idea is the same. We perform a many to many merge matching on household (sampn in your current example data) and remove the cases of comparing a person to themself (perno). We check for overlaps, then collapse to a single record per household and person. Note that this code will error if all records have the same perno.
compare <-
df %>%
left_join(df %>%
rename(compare_PERNO = PERNO,
compare_arr = arr,
compare_dep = dep), by = ("SAMPN")) %>%
filter(PERNO != compare_PERNO) %>%
mutate(overlap = arr <= compare_dep & dep >= compare_arr) %>%
group_by(SAMPN, PERNO) %>%
summarize(overlap = max(overlap))
SQL Solution with household grouping
Grouping the data by household actually makes this problem slightly easier. Again, I'm using SQL to solve it. In the inner SQL statement I do a many to many merge matching all members of a household to all other members, I remove any cases of matching a person to themself. Then, in the outer SQL statement we reduce to one record per household and person, which indicates if they ever overlapped.
df = data.frame(
household = c(rep(1,5), rep(2,2)),
person = c(1:5, 1:2),
start_time=as.Date(c("2017-05-31","2018-01-14", "2019-02-03", "2018-01-19", "2019-04-17",
"2018-02-03", "2018-03-03"),
format="%Y-%m-%d"),
end_time=as.Date(c("2018-01-17", "2018-01-20", "2019-04-15", "2018-02-20", "2019-05-17",
"2019-03-03", "2019-03-03"),
format="%Y-%m-%d"))
library(sqldf)
compare <- sqldf(
"
SELECT * FROM (
SELECT L.* ,
CASE WHEN L.start_time <= R.end_time AND L.end_time >= R.start_time THEN 1
ELSE 0 END AS overlap
FROM df as L
LEFT JOIN df as R ON L.household = R.household
WHERE L.person != R.person
)
GROUP BY household, person
HAVING overlap = MAX(overlap)
"
)
SQL Solution without household grouping
This is an SQL solution to your problem. I do a keyless many to many merge to compare each row to every other row (but dont compare a row to itself), then I parse the big data frame down to a single record per ID that records whether any matches were found. Your data isn't quite a reprex (use the dput function in R), so I used an example dataset I had lying around. If you have trouble adapting this to your exact data, post reproducible data and I can help you out.
df = data.frame(
id = 1:3,
start_time=as.Date(c("2017-05-31","2018-01-14", "2018-02-03"), format="%Y-%m-%d"),
end_time=as.Date(c("2018-01-17", "2018-01-20", "2018-04-17"), format="%Y-%m-%d"))
library(sqldf)
compare <- sqldf(
"
SELECT * FROM (
SELECT L.* ,
CASE WHEN L.start_time <= R.end_time AND L.end_time >= R.start_time THEN 1
ELSE 0 END AS overlap
FROM df as L
CROSS JOIN df as R
WHERE L.id != R.id
)
GROUP BY ID
HAVING overlap = MAX(overlap)
"
)
Related
id drug_name med_start med_end
<dbl> <chr> <date> <date>
1 pembrolizumab 2018-02-07 2018-02-07
1 pembrolizumab 2018-02-28 2018-02-28
2 pembrolizumab 2018-01-05 2018-01-05
2 nivolumab 2018-09-20 2018-09-20
2 nivolumab 2018-10-03 2018-10-03
2 nivolumab 2018-11-01 2018-11-01
I am trying to get ids who have both pembrolizumab and nivolumab in drug_name. Can I do a group_by over id? And then filter with both conditions?
For above table, id 2 has both drug_names. I might have situation where I will be filtering more than 2 drug_names.
I am also trying to find to see if the gap between two med_start is greater than x days. Let's say 30 days. Basically filter ids who have gap of 30 days between med_start.
Here is the code for above data
data <- structure(list(id = structure(c(1, 1, 2, 2, 2, 2), class = "int"),
drug_name = c("pembrolizumab", "pembrolizumab", "pembrolizumab",
"nivolumab", "nivolumab", "nivolumab"), med_start = structure(c(17569,
17590, 17536, 17794, 17807, 17836), class = "Date"), med_end = structure(c(17569,
17590, 17536, 17794, 17807, 17836), class = "Date")), row.names = c(NA,
-6L), groups = structure(list(patient_id = structure(c(1.49283861796358e-314,
1.6423825257779e-313), class = "integer64"), .rows = structure(list(
1:2, 3:6), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
We group by 'id', and filter where all the drugs of interest are %in% the 'drug_name' column, and extract the unique 'id'
library(dplyr)
data %>%
group_by(id) %>%
filter(all(c("pembrolizumab", "nivolumab") %in% drug_name)) %>%
ungroup %>%
pull(id)%>%
unique
-output
[1] 2
Here are some base R options
for the first question
> unique(
+ subset(
+ data,
+ ave(match(drug_name, c("pembrolizumab", "nivolumab")), id, FUN = var) > 0,
+ select = id
+ )
+ )
# A tibble: 1 x 1
id
<int>
1 2
for the second question
> subset(
+ data,
+ ave(as.integer(med_start), id, FUN = function(x) max(diff(x))) <= 30
+ )
# A tibble: 2 x 4
id drug_name med_start med_end
<int> <chr> <date> <date>
1 1 pembrolizumab 2018-02-07 2018-02-07
2 1 pembrolizumab 2018-02-28 2018-02-28
Background
Here's a df with some data in it from a Costco-like members-only big-box store:
d <- data.frame(ID = c("a","a","b","c","c","d"),
purchase_type = c("grocery","grocery",NA,"auto","grocery",NA),
date_joined = as.Date(c("2014-01-01","2014-01-01","2013-04-30","2009-03-08","2009-03-08","2015-03-04")),
date_purchase = as.Date(c("2014-04-30","2016-07-08","2013-06-29","2015-04-07","2017-09-10","2017-03-10")),
stringsAsFactors=T)
d <- d %>%
mutate(date_diff = d$date_purchase - d$date_joined)
This yields the following table:
As you can see, it's got a member ID, purchase types based on the broad category of what people bought, and two dates: the date the member originally became a member, and the date of a given purchase. I've also made a variable date_diff to tally the time between a given purchase and the beginning of membership.
The Problem
I'd like to make a new variable early_shopper that's marked 1 on all of a member's purchases if
That member's first purchase was made within a year of joining (so date_diff <= 365 days).
This first purchase doesn't have an NA in purchase_type.
If these criteria aren't met, give a 0.
What I'm looking for is a table that looks like this:
Note that Member a is the only "true" early_shopper: their first purchase is non-NA in purchase_type, and only 119 days passed between their joining the store and making a purchase there. Member b looks like they could be based on my date_diff criterion, but since they don't have a non-NA value in purchase_type, they don't count as an early_shopper.
What I've Tried
So far, I've tried using mutate and first functions like this:
d <- d %>%
mutate(early_shopper = if_else(!is.na(first(purchase_type,order_by = date_joined)) & date_diff < 365, 1, 0))
Which gives me this:
Something's kinda working here, but not fully. As you can see, I get the correct early_shopper = 1 in Member a's first purchase, but not their second. I also get a false positive with member b, who's marked as an early_shopper when I don't want them to be (because their purchase_type is NA).
Any ideas? I can further clarify if need be. Thanks!
You could use
library(dplyr)
d %>%
mutate(date_diff = date_purchase - date_joined) %>%
group_by(ID, purchase_type) %>%
arrange(ID, date_joined) %>%
mutate(
early_shopper = +(!is.na(first(purchase_type)) & date_diff <= 365)
) %>%
group_by(ID) %>%
mutate(early_shopper = max(early_shopper)) %>%
ungroup()
which returns
# A tibble: 6 x 6
ID purchase_type date_joined date_purchase date_diff early_shopper
<fct> <fct> <date> <date> <drtn> <int>
1 a grocery 2014-01-01 2014-04-30 119 days 1
2 a grocery 2014-01-01 2016-07-08 919 days 1
3 b NA 2013-04-30 2013-06-29 60 days 0
4 c auto 2009-03-08 2015-04-07 2221 days 0
5 c grocery 2009-03-08 2017-09-10 3108 days 0
6 d NA 2015-03-04 2017-03-10 737 days 0
If you want the early_shopper column to be boolean/logical, just remove the +.
Data
I used this data, here the date_joined for b is 2013-04-30 like shown in your images and not like in your actual data posted.
structure(list(ID = structure(c(1L, 1L, 2L, 3L, 3L, 4L), .Label = c("a",
"b", "c", "d"), class = "factor"), purchase_type = structure(c(2L,
2L, NA, 1L, 2L, NA), .Label = c("auto", "grocery"), class = "factor"),
date_joined = structure(c(16071, 16071, 15825, 14311, 14311,
16498), class = "Date"), date_purchase = structure(c(16190,
16990, 15885, 16532, 17419, 17235), class = "Date")), class = "data.frame", row.names = c(NA,
-6L))
Here is my approach using a join to get the early_shopper value to be the same for all rows of the same ID.
library(dplyr)
d <- structure(list(ID = structure(c(1L, 1L, 2L, 3L, 3L, 4L),
.Label = c("a","b", "c", "d"),
class = "factor"),
purchase_type = structure(c(2L, 2L, NA, 1L, 2L, NA),
.Label = c("auto", "grocery"),
class = "factor"),
date_joined = structure(c(16071, 16071, 15825, 14311, 14311, 16498),
class = "Date"),
date_purchase = structure(c(16190, 16990, 15885, 16532, 17419, 17235),
class = "Date")),
class = "data.frame", row.names = c(NA, -6L))
d %>%
inner_join(d %>%
mutate(date_diff = d$date_purchase - d$date_joined) %>%
group_by(ID) %>%
slice_min(date_diff) %>%
transmute(early_shopper = if_else(!is.na(first(purchase_type,
order_by = date_joined)) &
date_diff < 365, 1, 0)) %>%
ungroup()
)
ID purchase_type date_joined date_purchase early_shopper
1 a grocery 2014-01-01 2014-04-30 1
2 a grocery 2014-01-01 2016-07-08 1
3 b <NA> 2013-04-30 2013-06-29 0
4 c auto 2009-03-08 2015-04-07 0
5 c grocery 2009-03-08 2017-09-10 0
6 d <NA> 2015-03-04 2017-03-10 0
So I have a table of customers with the respective date as below:
ID
Date
1
2019-04-17
4
2019-05-12
1
2019-04-25
2
2019-05-19
I just want to count how many Customer is there for each month-year like below:
Month-Year
Count of Customer
Apr-19
2
May-19
2
EDIT:
Sorry but I think my Question should be clearer.
The same customer can appear more than once in a month and would be counted as 2 customer for the same month. I would basically like to find the number of transaction per month based on customer id.
My assumed approach would be to first change the date into a month-year format? And then I count each customer and grouped it for each month? but I am not sure how to do this in R. Thank you!
You can use count -
library(dplyr)
df %>% count(Month_Year = format(as.Date(Date), '%b-%y'))
# Month_Year n
#1 Apr-19 2
#2 May-19 2
Or table in base R -
table(format(as.Date(df$Date), '%b-%y'))
#Apr-19 May-19
# 2 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))
We can use zoo::as.yearmon
library(dplyr)
df %>%
count(Date = zoo::as.yearmon(Date))
Date n
1 Apr 2019 2
2 May 2019 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))
I have 3 data frames, df1 = a time interval, df2 = list of IDs, df3 = list of IDs with associated date.
df1 <- structure(list(season = structure(c(2L, 1L), .Label = c("summer",
"winter"), class = "factor"), mindate = structure(c(1420088400,
1433131200), class = c("POSIXct", "POSIXt")), maxdate = structure(c(1433131140,
1448945940), class = c("POSIXct", "POSIXt")), diff = structure(c(150.957638888889,
183.040972222222), units = "days", class = "difftime")), .Names = c("season",
"mindate", "maxdate", "diff"), row.names = c(NA, -2L), class = "data.frame")
df2 <- structure(list(ID = c(23796, 23796, 23796)), .Names = "ID", row.names = c(NA,
-3L), class = "data.frame")
df3 <- structure(list(ID = c("23796", "123456", "12134"), time = structure(c(1420909920,
1444504500, 1444504500), class = c("POSIXct", "POSIXt"), tzone = "US/Eastern")), .Names = c("ID",
"time"), row.names = c(NA, -3L), class = "data.frame")
The code should compare if df2$ID == df3$ID. If true, and if df3$time >= df1$mindate and df3$time <= df1$maxdate, then df1$maxdate - df3$time, else df1$maxdate - df1$mindate. I tried using the ifelse function. This works when i manually specify specific cells, but this is not what i want as I have many more (uneven rows) for each of the dfs.
df1$result <- ifelse(df2[1,1] == df3[1,1] & df3[1,2] >= df1$mindate & df3[1,2] <= df1$maxdate,
difftime(df1$maxdate,df3[1,2],units="days"),
difftime(df1$maxdate,df1$mindate,units="days")
EDIT: The desired output is (when removing last row of df2):
season mindate maxdate diff result
1 winter 2015-01-01 2015-05-31 23:59:00 150.9576 days 141.9576
2 summer 2015-06-01 2015-11-30 23:59:00 183.0410 days 183.0410
Any ideas? I don't see how I could merge dfs to make them of the same length. Note that df2 can be of any row length and not affect the code. Issues arise when df1 and df3 differ in # of rows.
The > and < are vectorized:
transform(df1,result=ifelse(df3$ID%in%df2$ID & df3$time>mindate & df3$time <maxdate, difftime(maxdate,df3$time),difftime(maxdate,mindate)))
season mindate maxdate diff result
1 winter 2014-12-31 21:00:00 2015-05-31 20:59:00 150.9576 days 141.9576
2 summer 2015-05-31 21:00:00 2015-11-30 20:59:00 183.0410 days 183.0410
You can also use the between function from data.table library
library(data.table)
transform(df1,result=ifelse(df3$ID%in%df2$ID&df3$time%between%df1[2:3],
difftime(maxdate,df3$time),difftime(maxdate,mindate)))
season mindate maxdate diff result
1 winter 2014-12-31 21:00:00 2015-05-31 20:59:00 150.9576 days 141.9576
2 summer 2015-05-31 21:00:00 2015-11-30 20:59:00 183.0410 days 183.0410
So i've been trying to get my head around this but i can't figure out how to do it.
This is an example:
ID Hosp. date Discharge date
1 2006-02-02 2006-02-04
1 2006-02-04 2006-02-18
1 2006-02-22 2006-03-24
1 2008-08-09 2008-09-14
2 2004-01-03 2004-01-08
2 2004-01-13 2004-01-15
2 2004-06-08 2004-06-28
What i want is a way to combine rows by ID, IF the discarge date is the same as the Hosp. date (or +-7 days) in the next row. So it would look like this:
ID Hosp. date Discharge date
1 2006-02-02 2006-03-24
1 2008-08-09 2008-09-14
2 2004-01-03 2004-01-15
2 2004-06-08 2004-06-28
Using the data.table-package:
# load the package
library(data.table)
# convert to a 'data.table'
setDT(d)
# make sure you have the correct order
setorder(d, ID, Hosp.date)
# summarise
d[, grp := cumsum(Hosp.date > (shift(Discharge.date, fill = Discharge.date[1]) + 7))
, by = ID
][, .(Hosp.date = min(Hosp.date), Discharge.date = max(Discharge.date))
, by = .(ID,grp)]
you get:
ID grp Hosp.date Discharge.date
1: 1 0 2006-02-02 2006-03-24
2: 1 1 2008-08-09 2008-09-14
3: 2 0 2004-01-03 2004-01-15
4: 2 1 2004-06-08 2004-06-28
The same logic with dplyr:
library(dplyr)
d %>%
arrange(ID, Hosp.date) %>%
group_by(ID) %>%
mutate(grp = cumsum(Hosp.date > (lag(Discharge.date, default = Discharge.date[1]) + 7))) %>%
group_by(grp, add = TRUE) %>%
summarise(Hosp.date = min(Hosp.date), Discharge.date = max(Discharge.date))
Used data:
d <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L),
Hosp.date = structure(c(13181, 13183, 13201, 14100, 12420, 12430, 12577), class = "Date"),
Discharge.date = structure(c(13183, 13197, 13231, 14136, 12425, 12432, 12597), class = "Date")),
.Names = c("ID", "Hosp.date", "Discharge.date"), class = "data.frame", row.names = c(NA, -7L))